Mastering Linux Kernel Development: A Developer's Reference Manual Packt Development
User Manual:
Open the PDF directly: View PDF .
Page Count: 521 [warning: Documents this large are best viewed by clicking the View PDF Link!]
- Preface
- Comprehending Processes, Address Space, and Threads
- Processes
- Process descriptors
- Kernel stack
- The issue of stack overflow
- Process creation
- Kernel threads
- Process status and termination
- Namespaces and cgroups
- Summary
- Deciphering the Process Scheduler
- Signal Management
- Memory Management and Allocators
- Filesystems and File I/O
- Interprocess Communication
- Virtual Memory Management
- Kernel Synchronization and Locking
- Interrupts and Deferred Work
- Clock and Time Management
- Module Management
MasteringLinuxKernelDevelopment
Akerneldeveloper'sreferencemanual
RaghuBharadwaj
BIRMINGHAM-MUMBAI
MasteringLinuxKernelDevelopment
Copyright©2017PacktPublishing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrieval
system,ortransmittedinanyformorbyanymeans,withoutthepriorwritten
permissionofthepublisher,exceptinthecaseofbriefquotationsembeddedin
criticalarticlesorreviews.
Everyefforthasbeenmadeinthepreparationofthisbooktoensuretheaccuracy
oftheinformationpresented.However,theinformationcontainedinthisbookis
soldwithoutwarranty,eitherexpressorimplied.Neithertheauthor,norPackt
Publishing,anditsdealersanddistributorswillbeheldliableforanydamages
causedorallegedtobecauseddirectlyorindirectlybythisbook.
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallof
thecompaniesandproductsmentionedinthisbookbytheappropriateuseof
capitals.However,PacktPublishingcannotguaranteetheaccuracyofthis
information.
Firstpublished:October2017
Productionreference:1091017
PublishedbyPacktPublishingLtd.
LiveryPlace
35LiveryStreet
Birmingham
B32PB,UK.
ISBN978-1-78588-305-7
www.packtpub.com
Credits
Author
RaghuBharadwaj
CopyEditor
MadhusudanUchil
Reviewer
RamiRosen
ProjectCoordinator
VirginiaDias
CommissioningEditor
KartikeyPandey
Proofreader
SafisEditing
AcquisitionEditor
RahulNair
Indexer
FrancyPuthiry
ContentDevelopmentEditor
SharonRaj
Graphics
KirkD'Penha
TechnicalEditor
MohitHassija
ProductionCoordinator
ArvindkumarGupta
AbouttheAuthor
RaghuBharadwajisaleadingconsultant,contributor,andcorporatetraineron
theLinuxkernelwithexperiencespanningclosetotwodecades.Heisanardent
kernelenthusiastandexpert,andhasbeencloselyfollowingtheLinuxkernel
sincethelate90s.HeisthefounderofTECHVEDA,whichspecializesin
engineeringandskillingservicesontheLinuxkernel,throughtechnicalsupport,
kernelcontributions,andadvancedtraining.Hispreciseunderstandingand
articulationofthekernelhasbeenahallmark,andhispenchantforsoftware
designsandOSarchitectureshasgarneredhimspecialmentionfromhisclients.
Raghuisalsoanexpertindeliveringsolution-oriented,customizedtraining
programsforengineeringteamsworkingontheLinuxkernel,Linuxdrivers,and
EmbeddedLinux.Someofhisclientsincludemajortechnologycompaniessuch
asXilinx,GE,Canon,Fujitsu,UTC,TCS,Broadcom,Sasken,Qualcomm,
Cognizant,STMicroelectronics,Stryker,andLatticeSemiconductors.
IwouldfirstliketothankPacktforgivingmethisopportunitytocomeupwith
thisbook.Iextendmysincereregardsalltheeditors(Sharonandtheteam)at
PacktforrallyingbehindmeandensuringthatIstayontimeandinlinein
deliveringprecise,crisp,andmostup-to-dateinformationthroughthisbook.
Iwouldalsoliketothankmyfamily,whosupportedmethroughoutmybusy
schedules.Lastly,butmostimportantly,IwouldliketothankmyteamatTECH
VEDAwhonotonlysupportedbutalsocontributedintheirownwaysthrough
valuablesuggestionsandfeedback.
AbouttheReviewer
RamiRosenistheauthorofLinuxKernelNetworking–Implementationand
Theory,abookpublishedbyApressin2013.Ramihasworkedformorethan20
yearsinhigh-techcompanies—startinghiswayinthreestartups.Mostofhis
work(pastandpresent)isaroundkernelanduserspacenetworkingand
virtualizationprojects,rangingfromdevicedriversandkernelnetworkstackand
DPDKtoNFVandOpenStack.Occasionally,hegivestalksininternational
conferencesandwritesarticlesforLWN.net—theLinuxJournal,andmore.
Ithankmywife,Yoonhwa,whoallowedmetospendweekendsreviewingthis
book.
www.PacktPub.com
Forsupportfilesanddownloadsrelatedtoyourbook,pleasevisitwww.PacktPub.co
m.
DidyouknowthatPacktofferseBookversionsofeverybookpublished,with
PDFandePubfilesavailable?YoucanupgradetotheeBookversionatwww.Packt
Pub.com,andasaprintbookcustomer,youareentitledtoadiscountontheeBook
copy.Getintouchwithusatservice@packtpub.comformoredetails.
Atwww.PacktPub.com,youcanalsoreadacollectionoffreetechnicalarticles,sign
upforarangeoffreenewslettersandreceiveexclusivediscountsandofferson
PacktbooksandeBooks.
https://www.packtpub.com/mapt
Getthemostin-demandsoftwareskillswithMapt.Maptgivesyoufullaccessto
allPacktbooksandvideocourses,aswellasindustry-leadingtoolstohelpyou
planyourpersonaldevelopmentandadvanceyourcareer.
Whysubscribe?
FullysearchableacrosseverybookpublishedbyPackt
Copyandpaste,print,andbookmarkcontent
Ondemandandaccessibleviaawebbrowser
CustomerFeedback
ThanksforpurchasingthisPacktbook.AtPackt,qualityisattheheartofour
editorialprocess.Tohelpusimprove,pleaseleaveusanhonestreviewonthis
book'sAmazonpageathttps://www.amazon.com/dp/1785883054.
Ifyou'dliketojoinourteamofregularreviewers,youcanemailusat
customerreviews@packtpub.com.WeawardourregularreviewerswithfreeeBooksand
videosinexchangefortheirvaluablefeedback.Helpusberelentlessin
improvingourproducts!
TableofContents
Preface
Whatthisbookcovers
Whatyouneedforthisbook
Whothisbookisfor
Conventions
Readerfeedback
Customersupport
Errata
Piracy
Questions
1. ComprehendingProcesses,AddressSpace,andThreads
Processes
Theillusioncalledaddressspace
Kernelanduserspace
Processcontext
Processdescriptors
Processattributes-keyelements
state
pid
tgid
threadinfo
flags
exit_codeandexit_signal
comm
ptrace
Processrelations-keyelements
real_parentandparent
children
sibling
group_leader
Schedulingattributes-keyelements
prioandstatic_prio
se,rt,anddl
policy
cpus_allowed
rt_priority
Processlimits-keyelements
Filedescriptortable-keyelements
fs
files
Signaldescriptor-keyelements
signal
sighand
sigset_tblocked,real_blocked
pending
sas_ss_sp
sas_ss_size
Kernelstack
Theissueofstackoverflow
Processcreation
fork()
Copy-on-write(COW)
exec
vfork()
Linuxsupportforthreads
clone()
Kernelthreads
do_fork()andcopy_process()
Processstatusandtermination
wait
exit
Namespacesandcgroups
Mountnamespaces
UTSnamespaces
IPCnamespaces
PIDnamespaces
Networknamespaces
Usernamespaces
Cgroupnamespaces
Controlgroups(cgroups)
Summary
2. DecipheringtheProcessScheduler
Processschedulers
Linuxprocessschedulerdesign
Runqueue
Thescheduler'sentrypoint
Processpriorities
Schedulerclasses
CompletelyFairSchedulingclass(CFS)
ComputingprioritiesandtimeslicesunderCFS
CFS'srunqueue
Groupscheduling
Schedulingentitiesundermany-coresystems
Schedulingpolicies
Real-timeschedulingclass
FIFO
RR
Real-timegroupscheduling
Deadlineschedulingclass(sporadictaskmodeldeadlinescheduling)
Schedulerrelatedsystemcalls
Processoraffinitycalls
Processpreemption
Summary
3. SignalManagement
Signals
Signal-managementAPIs
Raisingsignalsfromaprogram
Waitingforqueuedsignals
Signaldatastructures
Signaldescriptors
Blockedandpendingqueues
Signalhandlerdescriptor
Signalgenerationanddelivery
Signal-generationcalls
Signaldelivery
Executinguser-modehandlers
Settingupuser-modehandlerframes
Restartinginterruptedsystemcalls
Summary
4. MemoryManagementandAllocators
Initializationoperations
Pagedescriptor
Flags
Mapping
Zonesandnodes
Memoryzones
Memorynodes
Nodedescriptorstructure
Zonedescriptorstructure
Memoryallocators
Pageframeallocator
Buddysystem
GFPmask
Zonemodifiers
Pagemobilityandplacement
Watermarkmodifiers
Pagereclaimmodifiers
Actionmodifiers
Typeflags
Slaballocator
Kmalloccaches
Objectcaches
Cachemanagement
Cachelayout-generic
Slubdatastructures
Vmalloc
ContiguousMemoryAllocator(CMA)
Summary
5. FilesystemsandFileI/O
Filesystem-high-levelview
Metadata
Inode(indexnode)
Datablockmap
Directories
Superblock
Operations
Mountandunmountoperations
Filecreationanddeletionoperations
Fileopenandcloseoperations
Filereadandwriteoperations
Additionalfeatures
Extendedfileattributes
Filesystemconsistencyandcrashrecovery
Accesscontrollists(ACLs)
FilesystemsintheLinuxkernel
Extfamilyfilesystems
Ext2
Ext3
Ext4
Commonfilesysteminterface
VFSstructuresandoperations
structsuperblock
structinode
Structdentry
structfile
Specialfilesystems
Procfs
Sysfs
Debugfs
Summary
6. InterprocessCommunication
PipesandFIFOs
pipefs
Messagequeues
SystemVmessagequeues
Datastructures
POSIXmessagequeues
Sharedmemory
SystemVsharedmemory
Operationinterfaces
Allocatingsharedmemory
Attachingasharedmemory
Detachingsharedmemory
Datastructures
POSIXsharedmemory
Semaphores
SystemVsemaphores
Datastructures
POSIXsemaphores
Summary
7. VirtualMemoryManagement
Processaddressspace
Processmemorydescriptor
Managingvirtualmemoryareas
LocatingaVMA
MergingVMAregions
structaddress_space
Pagetables
Summary
8. KernelSynchronizationandLocking
Atomicoperations
Atomicintegeroperations
Atomicbitwiseoperations
Introducingexclusionlocks
Spinlocks
AlternatespinlockAPIs
Reader-writerspinlocks
Mutexlocks
Debugchecksandvalidations
Wait/woundmutexes
Operationinterfaces:
Semaphores
Reader-writersemaphores
Sequencelocks
API
Completionlocks
Initialization
Waitingforcompletion
Signallingcompletion
Summary
9. InterruptsandDeferredWork
Interruptsignalsandvectors
Programmableinterruptcontroller
Interruptcontrolleroperations
IRQdescriptortable
High-levelinterrupt-managementinterfaces
Registeringaninterrupthandler
Deregisteringaninterrupthandler
Threadedinterrupthandlers
Controlinterfaces
IRQstacks
Deferredwork
Softirqs
Tasklets
Workqueues
InterfaceAPI
Creatingdedicatedworkqueues
Summary
10. ClockandTimeManagement
Timerepresentation
Timinghardware
Real-timeclock(RTC)
Timestampcounter(TSC)
Programmableinterrupttimer(PIT)
CPUlocaltimer
High-precisioneventtimer(HPET)
ACPIpowermanagementtimer(ACPIPMT)
Hardwareabstraction
Calculatingelapsedtime
Linuxtimekeepingdatastructures,macros,andhelperroutines
Jiffies
Timevalandtimespec
Trackingandmaintainingtime
Tickandinterrupthandling
Tickdevices
Softwaretimersanddelayfunctions
Dynamictimers
Raceconditionswithdynamictimers
Dynamictimerhandling
Delayfunctions
POSIXclocks
Summary
11. ModuleManagement
Kernelmodules
ElementsofanLKM
BinarylayoutofaLKM
Loadandunloadoperations
Moduledatastructures
Memorylayout
Summary
Preface
MasteringLinuxKernelDevelopmentlooksattheLinuxkernel,itsinternal
arrangementanddesign,andvariouscoresubsystems,helpingyoutogain
significantunderstandingofthisopensourcemarvel.Youwilllookathowthe
Linuxkernel,whichpossessesakindofcollectiveintelligencethankstoits
scoresofcontributors,remainssoelegantowingtoitsgreatdesign.
Thisbookalsolooksatallthekeykernelcode,coredatastructures,functions,
andmacros,givingyouacomprehensivefoundationoftheimplementation
detailsofthekernel’scoreservicesandmechanisms.Youwillalsolookatthe
Linuxkernelaswell-designedsoftware,whichgivesusinsightsintosoftware
designingeneralthatareeasilyscalableyetfundamentallystrongandsafe.
Whatthisbookcovers
Chapter1,ComprehendingProcesses,AddressSpace,andThreads,lookscloselyatoneofthe
principalabstractionsofLinuxcalledtheprocessandthewholeecosystem,
whichfacilitatethisabstraction.Wewillalsospendtimeinunderstanding
addressspace,processcreation,andthreads.
Chapter2,DecipheringtheProcessScheduler,explainsprocessscheduling,which
isavitalaspectofanyoperatingsystem.Herewewillbuildourunderstanding
ofthedifferentschedulingpoliciesengagedbyLinuxtodelivereffectiveprocess
execution.
Chapter3,SignalManagement,helpsinunderstandingallcoreaspectsofsignal
usage,theirrepresentation,datastructures,andkernelroutinesforsignal
generationanddelivery.
Chapter4,MemoryManagementandAllocators,traversesusthroughoneofthe
mostcrucialaspectsoftheLinuxkernel,comprehendingvariousnuancesof
memoryrepresentationsandallocations.Wewillalsogaugetheefficiencyofthe
kernelinmaximizingresourceusageatminimalcosts.
Chapter5,FilesystemsandFileI/O,impartsagenericunderstandingofatypical
filesystem,itsfabric,design,andwhatmakesitanelementalpartofanoperating
system.Wewillalsolookatabstraction,usingthecommon,layeredarchitecture
design,whichthekernelcomprehensivelyimbibesthroughtheVFS.
Chapter6,InterprocessCommunication,touchesuponthevariousIPCmechanisms
offeredbythekernel.Wewillexplorethelayoutandrelationshipbetween
variousdatastructuresforeachIPCmechanism,andlookatboththeSysVand
POSIXIPCmechanisms.
Chapter7,VirtualMemoryManagement,explainsmemorymanagementwith
detailsofvirtualmemorymanagementandpagetables.Wewilllookintothe
variousaspectsofthevirtualmemorysubsystemsuchasprocessvirtualaddress
spaceanditssegments,memorydescriptorstructure,memorymappingand
VMAobjects,pagecacheandaddresstranslationwithpagetables.
Chapter8,KernelSynchronizationandLocking,enablesustounderstandthe
variousprotectionandsynchronizationmechanismsprovidedbythekernel,and
comprehendthemeritsandshortcomingsofthesemechanisms.Wewilltryand
appreciatethetenacitywithwhichthekerneladdressesthesevarying
synchronizationcomplexities.
Chapter9,InterruptsandDeferredwork,talksaboutinterrupts,whichareakey
facetofanyoperatingsystemtogetnecessaryandprioritytasksdone.Wewill
lookathowinterruptsaregenerated,handled,andmanagedinLinux.Wewill
alsolookatvariousbottomhalvemechanisms.
Chapter10,ClockandTimeManagement,revealshowkernelmeasuresand
managestime.Wewilllookatallkeytime-relatedstructures,routines,and
macrostohelpusgaugetimemanagementeffectively.
Chapter11,ModuleManagement,quicklylooksatmodules,kernel'sinfrastructure
inmanagingmodulesalongwithallthecoredatastructuresinvolved.Thishelps
usunderstandhowkernelinculcatesdynamicextensibility.
Whatyouneedforthisbook
ApartfromadeepdesiretounderstandthenuancesoftheLinuxkernelandits
design,youneedpriorunderstandingoftheLinuxoperatingsystemingeneral,
andtheideaofanopen-sourcesoftwaretostartspendingtimewiththisbook.
However,thisisnotbinding,andanyonewithakeeneyetograbdetailed
informationabouttheLinuxsystemanditsworkingcangrabthisbook.
Whothisbookisfor
Thisbookisforsystemprogrammingenthusiastsandprofessionalswho
wouldliketodeepentheirunderstandingoftheLinuxkernelanditsvarious
integralcomponents.
Thisisahandybookfordevelopersworkingonvariouskernel-related
projects.
Studentsofsoftwareengineeringcanusethisasareferenceguidefor
comprehendingvariousaspectsofLinuxkernelanditsdesignprinciples.
Conventions
Inthisbook,youwillfindanumberoftextstylesthatdistinguishbetween
differentkindsofinformation.Herearesomeexamplesofthesestylesandan
explanationoftheirmeaning.Codewordsintext,databasetablenames,folder
names,filenames,fileextensions,pathnames,dummyURLs,userinput,and
Twitterhandlesareshownasfollows:"Intheloop()function,wereadthevalue
ofthedistancefromthesensorandthendisplayitontheserialport."
Ablockofcodeissetasfollows:/*linux-
4.9.10/arch/x86/include/asm/thread_info.h*/
structthread_info{
unsignedlongflags;/*lowlevelflags*/
};
Newtermsandimportantwordsareshowninbold.Wordsthatyouseeonthe
screen,forexample,inmenusordialogboxes,appearinthetextlikethis:"Goto
Sketch|IncludeLibrary|ManageLibrariesandyouwillgetadialog."
Warningsorimportantnotesappearlikethis.
Tipsandtricksappearlikethis.
Readerfeedback
Feedbackfromourreadersisalwayswelcome.Letusknowwhatyouthink
aboutthisbook-whatyoulikedordisliked.Readerfeedbackisimportantforus
asithelpsusdeveloptitlesthatyouwillreallygetthemostoutof.Tosendus
generalfeedback,simplyemailfeedback@packtpub.com,andmentionthebook'stitle
inthesubjectofyourmessage.Ifthereisatopicthatyouhaveexpertiseinand
youareinterestedineitherwritingorcontributingtoabook,seeourauthor
guideatwww.packtpub.com/authors.
Customersupport
NowthatyouaretheproudownerofaPacktbook,wehaveanumberofthings
tohelpyoutogetthemostfromyourpurchase.
Errata
Althoughwehavetakeneverycaretoensuretheaccuracyofourcontent,
mistakesdohappen.Ifyoufindamistakeinoneofourbooks-maybeamistake
inthetextorthecode-wewouldbegratefulifyoucouldreportthistous.By
doingso,youcansaveotherreadersfromfrustrationandhelpusimprove
subsequentversionsofthisbook.Ifyoufindanyerrata,pleasereportthemby
visitinghttp://www.packtpub.com/submit-errata,selectingyourbook,clickingontheErrata
SubmissionFormlink,andenteringthedetailsofyourerrata.Onceyourerrata
areverified,yoursubmissionwillbeacceptedandtheerratawillbeuploadedto
ourwebsiteoraddedtoanylistofexistingerrataundertheErratasectionofthat
title.Toviewthepreviouslysubmittederrata,gotohttps://www.packtpub.com/books/cont
ent/supportandenterthenameofthebookinthesearchfield.Therequired
informationwillappearundertheErratasection.
Piracy
Piracyofcopyrightedmaterialontheinternetisanongoingproblemacrossall
media.AtPackt,wetaketheprotectionofourcopyrightandlicensesvery
seriously.Ifyoucomeacrossanyillegalcopiesofourworksinanyformonthe
internet,pleaseprovideuswiththelocationaddressorwebsitename
immediatelysothatwecanpursuearemedy.Pleasecontactusat
copyright@packtpub.comwithalinktothesuspectedpiratedmaterial.Weappreciate
yourhelpinprotectingourauthorsandourabilitytobringyouvaluablecontent.
Questions
Ifyouhaveaproblemwithanyaspectofthisbook,youcancontactusat
questions@packtpub.com,andwewilldoourbesttoaddresstheproblem.
ComprehendingProcesses,Address
Space,andThreads
Whenkernelservicesareinvokedinthecurrentprocesscontext,itslayout
throwsopentherightpathforexploringkernelsinmoredetail.Oureffortinthis
chapteriscenteredaroundcomprehendingprocessesandtheunderlying
ecosystemthekernelprovidesforthem.Wewillexplorethefollowingconcepts
inthischapter:
Programtoprocess
Processlayout
Virtualaddressspaces
Kernelanduserspace
ProcessAPIs
Processdescriptors
Kernelstackmanagement
Threads
LinuxthreadAPI
Datastructures
Namespaceandcgroups
Processes
Quintessentially,computingsystemsaredesigned,developed,andoftentweaked
forrunninguserapplicationsefficiently.Everyelementthatgoesintoa
computingplatformisintendedtoenableeffectiveandefficientwaysfor
runningapplications.Inotherwords,computingsystemsexisttorundiverse
applicationprograms.Applicationscanruneitherasfirmwareindedicated
devicesorasa"process"insystemsdrivenbysystemsoftware(operating
systems).
Atitscore,aprocessisarunninginstanceofaprograminmemory.The
transformationfromaprogramtoaprocesshappenswhentheprogram(ondisk)
isfetchedintomemoryforexecution.
Aprogram’sbinaryimagecarriescode(withallitsbinaryinstructions)anddata
(withallglobaldata),whicharemappedtodistinctregionsofmemorywith
appropriateaccesspermissions(read,write,andexecute).Apartfromcodeand
data,aprocessisassignedadditionalmemoryregionscalledstack(for
allocationoffunctioncallframeswithautovariablesandfunctionarguments)
andheapfordynamicallocationsatruntime.
Multipleinstancesofthesameprogramcanexistwiththeirrespectivememory
allocations.Forinstance,forawebbrowserwithmultipleopentabs(running
simultaneousbrowsingsessions),eachtabisconsideredaprocessinstanceby
thekernel,withuniquememoryallocations.
Thefollowingfigurerepresentsthelayoutofprocessesinmemory:
Theillusioncalledaddressspace
Modern-daycomputingplatformsareexpectedtohandleaplethoraofprocesses
efficiently.Operatingsystemsthusmustdealwithallocatinguniquememoryto
allcontendingprocesseswithinthephysicalmemory(oftenfinite)andalso
ensuretheirreliableexecution.Withmultipleprocessescontendingand
executingsimultaneously(multi-tasking),theoperatingsystemmustensurethat
thememoryallocationofeveryprocessisprotectedfromaccidentalaccessby
anotherprocess.
Toaddressthisissue,thekernelprovidesalevelofabstractionbetweenthe
processandthephysicalmemorycalledvirtualaddressspace.Virtualaddress
spaceistheprocess'viewofmemory;itishowtherunningprogramviewsthe
memory.
Virtualaddressspacecreatesanillusionthateveryprocessexclusivelyownsthe
wholememorywhileexecuting.Thisabstractedviewofmemoryiscalled
virtualmemoryandisachievedbythekernel'smemorymanagerincoordination
withtheCPU'sMMU.Eachprocessisgivenacontiguous32or64-bitaddress
space,boundbythearchitectureanduniquetothatprocess.Witheachprocess
cagedintoitsvirtualaddressspacebytheMMU,anyattemptbyaprocessto
accessanaddressregionoutsideitsboundarieswilltriggerahardwarefault,
makingitpossibleforthememorymangertodetectandterminateviolating
processes,thusensuringprotection.
Thefollowingfiguredepictstheillusionofaddressspacecreatedforevery
contendingprocess:
Kernelanduserspace
Modernoperatingsystemsnotonlypreventoneprocessfromaccessinganother
butalsopreventprocessesfromaccidentallyaccessingormanipulatingkernel
dataandservices(asthekernelissharedbyalltheprocesses).
Operatingsystemsachievethisprotectionbysegmentingthewholememoryinto
twologicalhalves,theuserandkernelspace.Thisbifurcationensuresthatall
processesthatareassignedaddressspacesaremappedtotheuserspacesection
ofmemoryandkerneldataandservicesruninkernelspace.Thekernelachieves
thisprotectionincoordinationwiththehardware.Whileanapplicationprocess
isexecutinginstructionsfromitscodesegment,theCPUisoperatinginuser
mode.Whenaprocessintendstoinvokeakernelservice,itneedstoswitchthe
CPUintoprivilegedmode(kernelmode),whichisachievedthroughspecial
functionscalledAPIs(applicationprogramminginterfaces).TheseAPIsenable
userprocessestoswitchintothekernelspaceusingspecialCPUinstructionsand
thenexecutetherequiredservicesthroughsystemcalls.Oncompletionofthe
requestedservice,thekernelexecutesanothermodeswitch,thistimebackfrom
kernelmodetousermode,usinganothersetofCPUinstructions.
Systemcallsarethekernel'sinterfacestoexposeitsservicesto
applicationprocesses;theyarealsocalledkernelentrypoints.As
systemcallsareimplementedinkernelspace,therespective
handlersareprovidedthroughAPIsintheuserspace.API
abstractionalsomakesiteasierandconvenienttoinvokerelated
systemcalls.
Thefollowingfiguredepictsavirtualizedmemoryview:
Processcontext
Whenaprocessrequestsakernelservicethroughasystemcall,thekernelwill
executeonbehalfofthecallerprocess.Thekernelisnowsaidtobeexecutingin
processcontext.Similarly,thekernelalsorespondstointerruptsraisedbyother
hardwareentities;here,thekernelexecutesininterruptcontext.Whenin
interruptcontext,thekernelisnotrunningonbehalfofanyprocess.
Processdescriptors
Rightfromthetimeaprocessisbornuntilitexits,it’sthekernel'sprocess
managementsubsystemthatcarriesoutvariousoperations,rangingfromprocess
creation,allocatingCPUtime,andeventnotificationstodestructionofthe
processupontermination.
Apartfromtheaddressspace,aprocessinmemoryisalsoassignedadata
structurecalledtheprocessdescriptor,whichthekernelusestoidentify,
manage,andscheduletheprocess.Thefollowingfiguredepictsprocessaddress
spaceswiththeirrespectiveprocessdescriptorsinthekernel:
InLinux,aprocessdescriptorisaninstanceoftypestructtask_structdefinedin
<linux/sched.h>,itisoneofthecentraldatastructures,andcontainsallthe
attributes,identificationdetails,andresourceallocationentriesthataprocess
holds.Lookingatstructtask_structislikeapeekintothewindowofwhatthe
kernelseesorworkswithtomanageandscheduleaprocess.
Sincethetaskstructurecontainsawidesetofdataelements,whicharerelatedto
thefunctionalityofvariouskernelsubsystems,itwouldbeoutofcontextto
discussthepurposeandscopeofalltheelementsinthischapter.Weshall
considerafewimportantelementsthatarerelatedtoprocessmanagement.
Processattributes-keyelements
Processattributesdefineallthekeyandfundamentalcharacteristicsofaprocess.
Theseelementscontaintheprocess'sstateandidentificationsalongwithother
keyvaluesofimportance.
state
Aprocessrightfromthetimeitisspawneduntilitexitsmayexistinvarious
states,referredtoasprocessstates--theydefinetheprocess’scurrentstate:
TASK_RUNNING(0):ThetaskiseitherexecutingorcontendingforCPU
intheschedulerrun-queue.
TASK_INTERRUPTIBLE(1):Thetaskisinaninterruptiblewaitstate;it
remainsinwaituntilanawaitedconditionbecomestrue,suchasthe
availabilityofmutualexclusionlocks,devicereadyforI/O,lapseofsleep
time,oranexclusivewake-upcall.Whileinthiswaitstate,anysignals
generatedfortheprocessaredelivered,causingittowakeupbeforethe
waitconditionismet.
TASK_KILLABLE:ThisissimilartoTASK_INTERRUPTIBLE,with
theexceptionthatinterruptionscanonlyoccuronfatalsignals,which
makesitabetteralternativetoTASK_INTERRUPTIBLE.
TASK_UNINTERRUTPIBLE(2):Thetaskisinuninterruptiblewaitstate
similartoTASK_INTERRUPTIBLE,exceptthatgeneratedsignalstothe
sleepingprocessdonotcausewake-up.Whentheeventoccursforwhichit
iswaiting,theprocesstransitionstoTASK_RUNNING.Thisprocessstate
israrelyused.
TASK_STOPPED(4):ThetaskhasreceivedaSTOPsignal.Itwillbe
backtorunningonreceivingthecontinuesignal(SIGCONT).
TASK_TRACED(8):Aprocessissaidtobeintracedstatewhenitis
beingcombed,probablybyadebugger.
EXIT_ZOMBIE(32):Theprocessisterminated,butitsresourcesarenot
yetreclaimed.
EXIT_DEAD(16):Thechildisterminatedandalltheresourcesheldbyit
freed,aftertheparentcollectstheexitstatusofthechildusingwait.
Thefollowingfiguredepictsprocessstates:
pid
ThisfieldcontainsauniqueprocessidentifierreferredtoasPID.PIDsinLinux
areofthetypepid_t(integer).ThoughaPIDisaninteger,thedefaultmaximum
numberPIDsis32,768specifiedthroughthe/proc/sys/kernel/pid_maxinterface.
Thevalueinthisfilecanbesettoanyvalueupto222(PID_MAX_LIMIT,
approximately4million).
TomanagePIDs,thekernelusesabitmap.Thisbitmapallowsthekerneltokeep
trackofPIDsinuseandassignauniquePIDfornewprocesses.EachPIDis
identifiedbyabitinthePIDbitmap;thevalueofaPIDisdeterminedfromthe
positionofitscorrespondingbit.Bitswithvalue1inthebitmapindicatethatthe
correspondingPIDsareinuse,andthosewithvalue0indicatefreePIDs.
WheneverthekernelneedstoassignauniquePID,itlooksforthefirstunsetbit
andsetsitto1,andconverselytofreeaPID,ittogglesthecorrespondingbit
from1to0.
tgid
Thisfieldcontainsthethreadgroupid.Foreasyunderstanding,let'ssaywhena
newprocessiscreated,itsPIDandTGIDarethesame,astheprocesshappensto
betheonlythread.Whentheprocessspawnsanewthread,thenewchildgetsa
uniquePIDbutinheritstheTGIDfromtheparent,asitbelongstothesame
threadgroup.TheTGIDisprimarilyusedtosupportmulti-threadedprocess.We
willdelveintofurtherdetailsinthethreadssectionofthischapter.
threadinfo
Thisfieldholdsprocessor-specificstateinformation,andisacriticalelementof
thetaskstructure.Latersectionsofthischaptercontaindetailsaboutthe
importanceofthread_info.
#definePF_EXITING/*gettingshutdown*/<br/>#define
PF_EXITPIDONE/*piexitdoneonshutdown*/<br/>#define
PF_VCPU/*I'mavirtualCPU*/<br/>#definePF_WQ_WORKER/*
I'maworkqueueworker*/<br/>#definePF_FORKNOEXEC/*
forkedbutdidn'texec*/<br/>#definePF_MCE_PROCESS/*process
policyonmceerrors*/<br/>#definePF_SUPERPRIV/*usedsuper-
userprivileges*/<br/>#definePF_DUMPCORE/*dumpedcore
*/<br/>#definePF_SIGNALED/*killedbyasignal*/<br/>#define
PF_MEMALLOC/*Allocatingmemory*/<br/>#define
PF_NPROC_EXCEEDED/*set_usernoticedthatRLIMIT_NPROC
wasexceeded*/<br/>#definePF_USED_MATH/*ifunsetthefpu
mustbeinitializedbeforeuse*/<br/>#definePF_USED_ASYNC/*
usedasync_schedule*(),usedbymoduleinit*/<br/>#define
PF_NOFREEZE/*thisthreadshouldnotbefrozen*/<br/>#define
PF_FROZEN/*frozenforsystemsuspend*/<br/>#define
PF_FSTRANS/*insideafilesystemtransaction*/<br/>#define
PF_KSWAPD/*Iamkswapd*/<br/>#define
PF_MEMALLOC_NOIO0/*AllocatingmemorywithoutIO
involved*/<br/>#definePF_LESS_THROTTLE/*Throttlemeless:I
cleanmemory*/<br/>#definePF_KTHREAD/*Iamakernelthread
*/<br/>#definePF_RANDOMIZE/*randomizevirtualaddressspace
*/<br/>#definePF_SWAPWRITE/*Allowedtowritetoswap
*/<br/>#definePF_NO_SETAFFINITY/*Userlandisnotallowedto
meddlewithcpus_allowed*/<br/>#definePF_MCE_EARLY/*Early
killformceprocesspolicy*/<br/>#definePF_MUTEX_TESTER/*
Threadbelongstothertmutextester*/<br/>#define
PF_FREEZER_SKIP/*Freezershouldnotcountitasfreezable
*/<br/>#definePF_SUSPEND_TASK/*thisthreadcalled
freeze_processesandshouldnotbefrozen*/
exit_codeandexit_signal
Thesefieldscontaintheexitvalueofthetaskanddetailsofthesignalthat
causedthetermination.Thesefieldsaretobeaccessedbytheparentprocess
throughwait()onterminationofthechild.
comm
Thisfieldholdsthenameofthebinaryexecutableusedtostarttheprocess.
ptrace
Thisfieldisenabledandsetwhentheprocessisputintotracemodeusingthe
ptrace()systemcall.
Processrelations-keyelements
Everyprocesscanberelatedtoaparentprocess,establishingaparent-child
relationship.Similarly,multipleprocessesspawnedbythesameprocessare
calledsiblings.Thesefieldsestablishhowthecurrentprocessrelatestoanother
process.
real_parentandparent
Thesearepointerstotheparent'staskstructure.Foranormalprocess,boththese
pointersrefertothesametask_struct;theyonlydifferformulti-threadprocesses,
implementedusingposixthreads.Forsuchcases,real_parentreferstotheparent
threadtaskstructureandparentreferstheprocesstaskstructuretowhich
SIGCHLDisdelivered.
children
Thisisapointertoalistofchildtaskstructures.
sibling
Thisisapointertoalistofsiblingtaskstructures.
group_leader
Thisisapointertothetaskstructureoftheprocessgroupleader.
Schedulingattributes-keyelements
AllcontendingprocessesmustbegivenfairCPUtime,andthiscallsfor
schedulingbasedontimeslicesandprocesspriorities.Theseattributescontain
necessaryinformationthatthescheduleruseswhendecidingonwhichprocess
getsprioritywhencontending.
prioandstatic_prio
priohelpsdeterminethepriorityoftheprocessforscheduling.Thisfieldholds
staticpriorityoftheprocesswithintherange1to99(asspecifiedby
sched_setscheduler())iftheprocessisassignedareal-timeschedulingpolicy.For
normalprocesses,thisfieldholdsadynamicpriorityderivedfromthenicevalue.
se,rt,anddl
Everytaskbelongstoaschedulingentity(groupoftasks),asschedulingisdone
ataper-entitylevel.seisforallnormalprocesses,rtisforreal-timeprocesses,
anddlisfordeadlineprocesses.Wewilldiscussmoreontheseattributesinthe
nextchapteronscheduling.
policy
Thisfieldcontainsinformationabouttheschedulingpolicyoftheprocess,which
helpsindeterminingitspriority.
cpus_allowed
ThisfieldspecifiestheCPUmaskfortheprocess,thatis,onwhichCPU(s)the
processiseligibletobescheduledinamulti-processorsystem.
rt_priority
Thisfieldspecifiestheprioritytobeappliedbyreal-timeschedulingpolicies.
Fornon-real-timeprocesses,thisfieldisunused.
/*include/uapi/linux/resource.h*/<br/>structrlimit{<br/>
__kernel_ulong_trlim_cur;<br/>__kernel_ulong_trlim_max;<br/>};
<br/>Theselimitsarespecifiedin<em>include/uapi/asm-
generic/resource.h<br/></em><br/>#defineRLIMIT_CPU0/*CPU
timeinsec*/<br/>#defineRLIMIT_FSIZE1/*Maximumfilesize
*/<br/>#defineRLIMIT_DATA2/*maxdatasize*/<br/>#define
RLIMIT_STACK3/*maxstacksize*/<br/>#define
RLIMIT_CORE4/*maxcorefilesize*/<br/>#ifndef
RLIMIT_RSS<br/>#defineRLIMIT_RSS5/*maxresidentsetsize
*/<br/>#endif<br/>#ifndefRLIMIT_NPROC<br/>#define
RLIMIT_NPROC6/*maxnumberofprocesses*/<br/>#endif<br/>
#ifndefRLIMIT_NOFILE<br/>#defineRLIMIT_NOFILE7/*max
numberofopenfiles*/<br/>#endif<br/>#ifndef
RLIMIT_MEMLOCK<br/>#defineRLIMIT_MEMLOCK8/*max
locked-in-memory<br/>addressspace*/<br/>#endif<br/>#ifndef
RLIMIT_AS<br/>#defineRLIMIT_AS9/*addressspacelimit
*/<br/>#endif<br/>#defineRLIMIT_LOCKS10/*maximumfile
locksheld*/<br/>#defineRLIMIT_SIGPENDING11/*max
numberofpendingsignals*/<br/>#defineRLIMIT_MSGQUEUE12
/*maximumbytesinPOSIXmqueues*/<br/>#define
RLIMIT_NICE13/*maxniceprioallowedto<br/>raiseto0-39for
nicelevel19..-20*/<br/>#defineRLIMIT_RTPRIO14/*maximum
realtimepriority*/<br/>#defineRLIMIT_RTTIME15/*timeoutfor
RTtasksinus*/<br/>#defineRLIM_NLIMITS16
Filedescriptortable-keyelements
Duringthelifetimeofaprocess,itmayaccessvariousresourcefilestogetits
taskdone.Thisresultsintheprocessopening,closing,reading,andwritingto
thesefiles.Thesystemmustkeeptrackoftheseactivities;filedescriptor
elementshelpthesystemknowwhichfilestheprocessholds.
fs
Filesysteminformationisstoredinthisfield.
files
Thefiledescriptortablecontainspointerstoallthefilesthataprocessopensto
performvariousoperations.Thefilesfieldcontainsapointer,whichpointsto
thisfiledescriptortable.
Signaldescriptor-keyelements
Forprocessestohandlesignals,thetaskstructurehasvariouselementsthat
determinehowthesignalsmustbehandled.
signal
Thisisoftypestructsignal_struct,whichcontainsinformationonallthesignals
associatedwiththeprocess.
sighand
Thisisoftypestructsighand_struct,whichcontainsallsignalhandlersassociated
withtheprocess.
sigset_tblocked,real_blocked
Theseelementsidentifysignalsthatarecurrentlymaskedorblockedbythe
process.
pending
Thisisoftypestructsigpending,whichidentifiessignalswhicharegeneratedbut
notyetdelivered.
sas_ss_sp
Thisfieldcontainsapointertoanalternatestack,whichfacilitatessignal
handling.
sas_ss_size
Thisfiledshowsthesizeofthealternatestack,usedforsignalhandling.
Kernelstack
Withcurrent-generationcomputingplatformspoweredbymulti-corehardware
capableofrunningsimultaneousapplications,thepossibilityofmultiple
processesconcurrentlyinitiatingkernelmodeswitchwhenrequestingforthe
sameprocessisbuiltin.Tobeabletohandlesuchsituations,kernelservicesare
designedtobere-entrant,allowingmultipleprocessestostepinandengagethe
requiredservices.Thismandatedtherequestingprocesstomaintainitsown
privatekernelstacktokeeptrackofthekernelfunctioncallsequence,storelocal
dataofthekernelfunctions,andsoon.
Thekernelstackisdirectlymappedtothephysicalmemory,mandatingthe
arrangementtobephysicallyinacontiguousregion.Thekernelstackbydefault
is8kbforx86-32andmostother32-bitsystems(withanoptionof4kkernel
stacktobeconfiguredduringkernelbuild),and16kbonanx86-64system.
Whenkernelservicesareinvokedinthecurrentprocesscontext,theyneedto
validatetheprocess’sprerogativebeforeitcommitstoanyrelevantoperations.
Toperformsuchvalidations,thekernelservicesmustgainaccesstothetask
structureofthecurrentprocessandlookthroughtherelevantfields.Similarly,
kernelroutinesmightneedtohaveaccesstothecurrenttaskstructurefor
modifyingvariousresourcestructuressuchassignalhandlertables,lookingfor
pendingsignals,filedescriptortable,andmemorydescriptoramongothers.To
enableaccessingthetaskstructureatruntime,theaddressofthecurrenttask
structureisloadedintoaprocessorregister(registerchosenisarchitecture
specific)andmadeavailablethroughakernelglobalmacrocalledcurrent
(definedinarchitecture-specifickernelheaderasm/current.h):
/*arch/ia64/include/asm/current.h*/
#ifndef_ASM_IA64_CURRENT_H
#define_ASM_IA64_CURRENT_H
/*
*Modified1998-2000
*DavidMosberger-Tang<davidm@hpl.hp.com>,Hewlett-PackardCo
*/
#include<asm/intrinsics.h>
/*
*Inkernelmode,threadpointer(r13)isusedtopointtothe
currenttask
*structure.
*/
#definecurrent((structtask_struct*)ia64_getreg(_IA64_REG_TP))
#endif/*_ASM_IA64_CURRENT_H*/
/*arch/powerpc/include/asm/current.h*/
#ifndef_ASM_POWERPC_CURRENT_H
#define_ASM_POWERPC_CURRENT_H
#ifdef__KERNEL__
/*
*Thisprogramisfreesoftware;youcanredistributeitand/or
*modifyitunderthetermsoftheGNUGeneralPublicLicense
*aspublishedbytheFreeSoftwareFoundation;eitherversion
*2oftheLicense,or(atyouroption)anylaterversion.
*/
structtask_struct;
#ifdef__powerpc64__
#include<linux/stddef.h>
#include<asm/paca.h>
staticinlinestructtask_struct*get_current(void)
{
structtask_struct*task;
__asm____volatile__("ld%0,%1(13)"
:"=r"(task)
:"i"(offsetof(structpaca_struct,__current)));
returntask;
}
#definecurrentget_current()
#else
/*
*Wekeep`current'inr2forspeed.
*/
registerstructtask_struct*currentasm("r2");
#endif
#endif/*__KERNEL__*/
#endif/*_ASM_POWERPC_CURRENT_H*/
However,inregister-constrictedarchitectures,wheretherearefewregistersto
spare,reservingaregistertoholdtheaddressofthecurrenttaskstructureisnot
viable.Onsuchplatforms,thetaskstructureofthecurrentprocessisdirectly
madeavailableatthetopofthekernelstackthatitowns.Thisapproachrenders
asignificantadvantagewithrespecttolocatingthetaskstructure,byjustmasking
theleastsignificantbitsofthestackpointer.
Withtheevolutionofthekernel,thetaskstructuregrewandbecametoolargeto
becontainedinthekernelstack,whichisalreadyrestrictedinphysicalmemory
(8Kb).Asaresult,thetaskstructurewasmovedoutofthekernelstack,barringa
fewkeyfieldsthatdefinetheprocess'sCPUstateandotherlow-levelprocessor-
specificinformation.Thesefieldswerethenwrappedinanewlycreated
structurecalledstructthread_info.Thisstructureiscontainedontopofthekernel
stackandprovidesapointerthatreferstothecurrenttaskstructure,whichcanbe
usedbykernelservices.
Thefollowingcodesnippetshowsstructthread_infoforx86architecture(kernel
3.10):
/*linux-3.10/arch/x86/include/asm/thread_info.h*/
structthread_info{
structtask_struct*task;/*maintaskstructure*/
structexec_domain*exec_domain;/*executiondomain*/
__u32flags;/*lowlevelflags*/
__u32status;/*threadsynchronousflags*/
__u32cpu;/*currentCPU*/
intpreempt_count;/*0=>preemptable,<0=>BUG*/
mm_segment_taddr_limit;
structrestart_blockrestart_block;
void__user*sysenter_return;
#ifdefCONFIG_X86_32
unsignedlongprevious_esp;/*ESPofthepreviousstackincaseof
nested(IRQ)stacks*/
__u8supervisor_stack[0];
#endif
unsignedintsig_on_uaccess_error:1;
unsignedintuaccess_err:1;/*uaccessfailed*/
};
Withthread_infocontainingprocess-relatedinformation,apartfromtaskstructure,
thekernelhasmultipleviewpointstothecurrentprocessstructure:struct
task_struct,anarchitecture-independentinformationblock,andthread_info,an
architecture-specificone.Thefollowingfiguredepictsthread_infoand
task_struct:
Forarchitecturesthatengagethread_info,thecurrentmacro'simplementationis
modifiedtolookintothetopofkernelstacktoobtainareferencetothecurrent
thread_infoandthroughitthecurrenttaskstructure.Thefollowingcodesnippet
showstheimplementationofcurrentforanx86-64platform:
#ifndef__ASM_GENERIC_CURRENT_H
#define__ASM_GENERIC_CURRENT_H
#include<linux/thread_info.h>
#defineget_current()(current_thread_info()->task)
#definecurrentget_current()
#endif/*__ASM_GENERIC_CURRENT_H*/
/*
*howtogetthecurrentstackpointerinC
*/
registerunsignedlongcurrent_stack_pointerasm("sp");
/*
*howtogetthethreadinformationstructfromC
*/
staticinlinestructthread_info*current_thread_info(void)
__attribute_const__;
staticinlinestructthread_info*current_thread_info(void)
{
return(structthread_info*)
(current_stack_pointer&~(THREAD_SIZE-1));
}
AsuseofPER_CPUvariableshasincreasedinrecenttimes,theprocessscheduleris
tunedtocachecrucialcurrentprocess-relatedinformationinthePER_CPUarea.
Thischangeenablesquickaccesstocurrentprocessdataoverlookingupthe
kernelstack.Thefollowingcodesnippetshowstheimplementationofthe
currentmacrotofetchthecurrenttaskdatathroughthePER_CPUvariable:
#ifndef_ASM_X86_CURRENT_H
#define_ASM_X86_CURRENT_H
#include<linux/compiler.h>
#include<asm/percpu.h>
#ifndef__ASSEMBLY__
structtask_struct;
DECLARE_PER_CPU(structtask_struct*,current_task);
static__always_inlinestructtask_struct*get_current(void)
{
returnthis_cpu_read_stable(current_task);
}
#definecurrentget_current()
#endif/*__ASSEMBLY__*/
#endif/*_ASM_X86_CURRENT_H*/
TheuseofPER_CPUdataledtoagradualreductionofinformationinthread_info.
Withthread_infoshrinkinginsize,kerneldevelopersareconsideringgettingridof
thread_infoaltogetherbymovingitintothetaskstructure.Asthisinvolves
changestolow-levelarchitecturecode,ithasonlybeenimplementedforthe
x86-64architecture,withotherarchitecturesplannedtofollow.Thefollowing
codesnippetshowsthecurrentstateofthethread_infostructurewithjustone
element:
/*linux-4.9.10/arch/x86/include/asm/thread_info.h*/
structthread_info{
unsignedlongflags;/*lowlevelflags*/
};
Theissueofstackoverflow
Unlikeusermode,thekernelmodestacklivesindirectlymappedmemory.
Whenaprocessinvokesakernelservice,whichmayinternallybedeeplynested,
chancesarethatitmayoverrunintoimmediatememoryrange.Theworstpartof
itisthekernelwillbeoblivioustosuchoccurrences.Kernelprogrammers
usuallyengagevariousdebugoptionstotrackstackusageanddetectoverruns,
butthesemethodsarenothandytopreventstackbreachesonproduction
systems.Conventionalprotectionthroughtheuseofguardpagesisalsoruled
outhere(asitwastesanactualmemorypage).
Kernelprogrammerstendtofollowcodingstandards--minimizingtheuseof
localdata,avoidingrecursion,andavoidingdeepnestingamongothers--tocut
downtheprobabilityofastackbreach.However,implementationoffeature-rich
anddeeplylayeredkernelsubsystemsmayposevariousdesignchallengesand
complications,especiallywiththestoragesubsystemwherefilesystems,storage
drivers,andnetworkingcodecanbestackedupinseverallayers,resultingin
deeplynestedfunctioncalls.
TheLinuxkernelcommunityhasbeenponderingoverpreventingsuchbreaches
forquitelong,andtowardthatend,thedecisionwasmadetoexpandthekernel
stackto16kb(x86-64,sincekernel3.15).Expansionofthekernelstackmight
preventsomebreaches,butatthecostofengagingmuchofthedirectlymapped
kernelmemoryfortheper-processkernelstack.However,forreliable
functioningofthesystem,itisexpectedofthekerneltoelegantlyhandlestack
breacheswhentheyshowuponproductionsystems.
Withthe4.9release,thekernelhascomewithanewsystemtosetupvirtually
mappedkernelstacks.Sincevirtualaddressesarecurrentlyinusetomapevena
directlymappedpage,principallythekernelstackdoesnotactuallyrequire
physicallycontiguouspages.Thekernelreservesaseparaterangeofaddresses
forvirtuallymappedmemory,andaddressesfromthisrangeareallocatedwhen
acalltovmalloc()ismade.Thisrangeofmemoryisreferredasthevmalloc
range.Primarilythisrangeisusedwhenprogramsrequirehugechunksof
memorywhicharevirtuallycontiguousbutphysicallyscattered.Usingthis,the
kernelstackcannowbeallottedasindividualpages,mappedtothevmalloc
range.Virtualmappingalsoenablesprotectionfromoverrunsasano-access
guardpagecanbeallocatedwithapagetableentry(withoutwastinganactual
page).Guardpageswouldpromptthekerneltopopanoopsmessageonmemory
overrunandinitiateakillagainstoverrunningprocess.
Virtuallymappedkernelstackswithguardpagesarecurrentlyavailableonlyfor
thex86-64architecture(supportforotherarchitecturesseeminglytofollow).
ThiscanbeenabledbychoosingtheHAVE_ARCH_VMAP_STACKorCONFIG_VMAP_STACKbuild-
timeoptions.
Processcreation
Duringkernelboot,akernelthreadcalledinitisspawned,whichinturnis
configuredtoinitializethefirstuser-modeprocess(withthesamename).The
init(pid1)processisthenconfiguredtocarryoutvariousinitialization
operationsspecifiedthroughconfigurationfiles,creatingmultipleprocesses.
Everychildprocessfurthercreated(whichmayinturncreateitsownchild
process(es))arealldescendantsoftheinitprocess.Processesthuscreatedendup
inatree-likestructureorasinglehierarchymodel.Theshell,whichisonesuch
process,becomestheinterfaceforuserstocreateuserprocesses,whenprograms
arecalledforexecution.
Fork,vfork,exec,clone,waitandexitarethecorekernelinterfacesforthe
creationandcontrolofnewprocess.Theseoperationsareinvokedthrough
correspondinguser-modeAPIs.
fork()
Fork()isoneofthecore"UnixthreadAPIs"availableacross*nixsystemssince
theinceptionoflegacyUnixreleases.Aptlynamed,itforksanewprocessfrom
arunningprocess.Whenfork()succeeds,thenewprocessiscreated(referredto
aschild)byduplicatingthecaller'saddressspaceandtaskstructure.Onreturnfrom
fork(),bothcaller(parent)andnewprocess(child)resumeexecutinginstructions
fromthesamecodesegmentwhichwasduplicatedundercopy-on-write.Fork()is
perhapstheonlyAPIthatenterskernelmodeinthecontextofcallerprocess,
andonsuccessreturnstousermodeinthecontextofbothcallerandchild(new
process).
Mostresourceentriesoftheparent'staskstructuresuchasmemorydescriptor,
filedescriptortable,signaldescriptors,andschedulingattributesareinheritedby
thechild,exceptforafewattributessuchasmemorylocks,pendingsignals,
activetimers,andfilerecordlocks(forthefulllistofexceptions,refertothe
fork(2)manpage).Achildprocessisassignedauniquepidandwillrefertoits
parent'spidthroughtheppidfieldofitstaskstructure;thechild’sresource
utilizationandprocessorusageentriesareresettozero.
Theparentprocessupdatesitselfaboutthechild’sstateusingthewait()system
callandnormallywaitsfortheterminationofthechildprocess.Failingtocall
wait(),thechildmayterminateandbepushedintoazombiestate.
Copy-on-write(COW)
Duplicationofparentprocesstocreateachildneedscloningoftheusermode
addressspace(stack,data,code,andheapsegments)andtaskstructureoftheparent
forthechild;thiswouldresultinexecutionoverheadthatleadstoun-
deterministicprocess-creationtime.Tomakemattersworse,thisprocessof
cloningwouldberendereduselessifneitherparentnorchilddidnotinitiateany
state-changeoperationsonclonedresources.
AsperCOW,whenachildiscreated,itisallocatedauniquetaskstructurewith
allresourceentries(includingpagetables)referringtotheparent'staskstructure,
withread-onlyaccessforbothparentandchild.Resourcesaretrulyduplicated
wheneitheroftheprocessesinitiatesastatechangeoperation,hencethename
copy-on-write(writeinCOWimpliesastatechange).COWdoesbring
effectivenessandoptimizationtothefore,bydeferringtheneedforduplicating
processdatauntilwrite,andincaseswhereonlyreadhappens,itavoidsit
altogether.Thison-demandcopyingalsoreducesthenumberofswappages
needed,cutsdownthetimespentonswapping,andmighthelpreducedemand
paging.
exec
Attimescreatingachildprocessmightnotbeuseful,unlessitrunsanew
programaltogether:theexecfamilyofcallsservespreciselythispurpose.exec
replacestheexistingprograminaprocesswithanewexecutablebinary:
#include<unistd.h>
intexecve(constchar*filename,char*constargv[],
char*constenvp[]);
Theexecveisthesystemcallthatexecutestheprogrambinaryfile,passedasthe
firstargumenttoit.Thesecondandthirdargumentsarenull-terminatedarraysof
argumentsandenvironmentstrings,tobepassedtoanewprogramascommand-
linearguments.Thissystemcallcanalsobeinvokedthroughvariousglibc
(library)wrappers,whicharefoundtobemoreconvenientandflexible:#include
<unistd.h>
externchar**environ;
intexecl(constchar*path,constchar*arg,...);
intexeclp(constchar*file,constchar*arg,...);
intexecle(constchar*path,constchar*arg,
...,char*constenvp[]);
intexecv(constchar*path,char*constargv[]);
intexecvp(constchar*file,char*constargv[]);
intexecvpe(constchar*file,char*constargv[],
char*constenvp[]);
Command-lineuser-interfaceprogramssuchasshellusetheexecinterfaceto
launchuser-requestedprogrambinaries.
vfork()
Unlikefork(),vfork()createsachildprocessandblockstheparent,whichmeans
thatthechildrunsasasinglethreadanddoesnotallowconcurrency;inother
words,theparentprocessistemporarilysuspendeduntilthechildexitsorcall
exec().Thechildsharesthedataoftheparent.
Linuxsupportforthreads
Theflowofexecutioninaprocessisreferredtoasathread,whichimpliesthat
everyprocesswillatleasthaveonethreadofexecution.Multi-threadedmeans
theexistenceofmultipleflowsofexecutioncontextsinaprocess.Withmodern
many-corearchitectures,multipleflowsofexecutioninaprocesscanbetruly
concurrent,achievingfairmultitasking.
Threadsarenormallyenumeratedaspureuser-levelentitieswithinaprocessthat
arescheduledforexecution;theyshareparent'svirtualaddressspaceandsystem
resources.Eachthreadmaintainsitscode,stack,andthreadlocalstorage.
Threadsarescheduledandmanagedbythethreadlibrary,whichusesastructure
referredtoasathreadobjecttoholdauniquethreadidentifier,forscheduling
attributesandtosavethethreadcontext.User-levelthreadapplicationsare
generallylighteronmemory,andarethepreferredmodelofconcurrencyfor
event-drivenapplications.Ontheflipside,suchuser-levelthreadmodelisnot
suitableforparallelcomputing,sincetheyaretiedontothesameprocessorcore
towhichtheirparentprocessisbound.
Linuxdoesn’tsupportuser-levelthreadsdirectly;itinsteadproposesanalternate
APItoenumerateaspecialprocess,calledlightweightprocess(LWP),thatcan
shareasetofconfiguredresourcessuchasdynamicmemoryallocations,global
data,openfiles,signalhandlers,andotherextensiveresourceswiththeparent
process.EachLWPisidentifiedbyauniquePIDandtaskstructure,andis
treatedbythekernelasanindependentexecutioncontext.InLinux,theterm
threadinvariablyreferstoLWP,sinceeachthreadinitializedbythethread
library(Pthreads)isenumeratedasanLWPbythekernel.
clone()
clone()isaLinux-specificsystemcalltocreateanewprocess;itisconsidereda
genericversionofthefork()systemcall,offeringfinercontrolstocustomizeits
functionalitythroughtheflagsargument:
intclone(int(*child_func)(void*),void*child_stack,intflags,void*arg);
ItprovidesmorethantwentydifferentCLONE_*flagsthatcontrolvariousaspectsof
thecloneoperation,includingwhethertheparentandchildprocessshare
resourcessuchasvirtualmemory,openfiledescriptors,andsignaldispositions.
Thechildiscreatedwiththeappropriatememoryaddress(passedasthesecond
argument)tobeusedasthestack(forstoringthechild'slocaldata).Thechild
processstartsitsexecutionwithitsstartfunction(passedasthefirstargumentto
theclonecall).
Whenaprocessattemptstocreateathreadthroughthepthreadlibrary,clone()is
invokedwiththefollowingflags:
/*cloneflagsforcreatingthreads*/
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID;
Theclone()canalsobeusedtocreatearegularchildprocessthatisnormally
spawnedusingfork()andvfork():
/*cloneflagsforforkingchild*/
flags=SIGCHLD;
/*cloneflagsforvforkchild*/
flags=CLONE_VFORK|CLONE_VM|SIGCHLD;
Kernelthreads
Toaugmenttheneedforrunningbackgroundoperations,thekernelspawns
threads(similartoprocesses).Thesekernelthreadsaresimilartoregular
processes,inthattheyarerepresentedbyataskstructureandassignedaPID.
Unlikeuserprocesses,theydonothaveanyaddressspacemapped,andrun
exclusivelyinkernelmode,whichmakesthemnon-interactive.Variouskernel
subsystemsusekthreadstorunperiodicandasynchronousoperations.
Allkernelthreadsaredescendantsofkthreadd(pid2),whichisspawnedbythe
kernel(pid0)duringboot.Thekthreaddenumeratesotherkernelthreads;it
providesinterfaceroutinesthroughwhichotherkernelthreadscanbe
dynamicallyspawnedatruntimebykernelservices.Kernelthreadscanbe
viewedfromthecommandlinewiththeps-efcommand--theyareshownin
[squarebrackets]:
UIDPIDPPIDCSTIMETTYTIMECMD
root10022:43?00:00:01/sbin/initsplash
root20022:43?00:00:00[kthreadd]
root32022:43?00:00:00[ksoftirqd/0]
root42022:43?00:00:00[kworker/0:0]
root52022:43?00:00:00[kworker/0:0H]
root72022:43?00:00:01[rcu_sched]
root82022:43?00:00:00[rcu_bh]
root92022:43?00:00:00[migration/0]
root102022:43?00:00:00[watchdog/0]
root112022:43?00:00:00[watchdog/1]
root122022:43?00:00:00[migration/1]
root132022:43?00:00:00[ksoftirqd/1]
root152022:43?00:00:00[kworker/1:0H]
root162022:43?00:00:00[watchdog/2]
root172022:43?00:00:00[migration/2]
root182022:43?00:00:00[ksoftirqd/2]
root202022:43?00:00:00[kworker/2:0H]
root212022:43?00:00:00[watchdog/3]
root222022:43?00:00:00[migration/3]
root232022:43?00:00:00[ksoftirqd/3]
root252022:43?00:00:00[kworker/3:0H]
root262022:43?00:00:00[kdevtmpfs]
/*kthreaddcreationcode(init/main.c)*/
staticnoinlinevoid__refrest_init(void)
{
intpid;
rcu_scheduler_starting();
/*
*Weneedtospawninitfirstsothatitobtainspid1,however
*theinittaskwillendupwantingtocreatekthreads,which,if
*wescheduleitbeforewecreatekthreadd,willOOPS.
*/
kernel_thread(kernel_init,NULL,CLONE_FS);
numa_default_policy();
pid=kernel_thread(kthreadd,NULL,CLONE_FS|CLONE_FILES);
rcu_read_lock();
kthreadd_task=find_task_by_pid_ns(pid,&init_pid_ns);
rcu_read_unlock();
complete(&kthreadd_done);
/*
*Thebootidlethreadmustexecuteschedule()
*atleastoncetogetthingsmoving:
*/
init_idle_bootup_task(current);
schedule_preempt_disabled();
/*Callintocpu_idlewithpreemptdisabled*/
cpu_startup_entry(CPUHP_ONLINE);
}
Thepreviouscodeshowsthekernelbootroutinerest_init()invokingthe
kernel_thread()routinewithappropriateargumentstospawnboththekernel_init
thread(whichthengoesontostarttheuser-modeinitprocess)andkthreadd.
Thekthreadisaperpetuallyrunningthreadthatlooksintoalistcalled
kthread_create_listfordataonnewkthreadstobecreated:
/*kthreaddroutine(kthread.c)*/
intkthreadd(void*unused)
{
structtask_struct*tsk=current;
/*Setupacleancontextforourchildrentoinherit.*/
set_task_comm(tsk,"kthreadd");
ignore_signals(tsk);
set_cpus_allowed_ptr(tsk,cpu_all_mask);
set_mems_allowed(node_states[N_MEMORY]);
current->flags|=PF_NOFREEZE;
for(;;){
set_current_state(TASK_INTERRUPTIBLE);
if(list_empty(&kthread_create_list))
schedule();
__set_current_state(TASK_RUNNING);
spin_lock(&kthread_create_lock);
while(!list_empty(&kthread_create_list)){
structkthread_create_info*create;
create=list_entry(kthread_create_list.next,
structkthread_create_info,list);
list_del_init(&create->list);
spin_unlock(&kthread_create_lock);
create_kthread(create);/*createskernelthreadswithattributesenqueued*/
spin_lock(&kthread_create_lock);
}
spin_unlock(&kthread_create_lock);
}
return0;
}
Kernelthreadsarecreatedbyinvokingeitherkthread_createorthroughitswrapper
kthread_runbypassingappropriateargumentsthatdefinethekthreadd(startroutine,
ARGdatatostartroutine,andname).Thefollowingcodesnippetshows
kthread_createinvokingkthread_create_on_node(),whichbydefaultcreatesthreadson
thecurrentNumanode:
structtask_struct*kthread_create_on_node(int(*threadfn)(void*data),
void*data,
intnode,
constcharnamefmt[],...);
/**
*kthread_create-createakthreadonthecurrentnode
*@threadfn:thefunctiontoruninthethread
*@data:datapointerfor@threadfn()
*@namefmt:printf-styleformatstringforthethreadname
*@...:argumentsfor@namefmt.
*
*Thismacrowillcreateakthreadonthecurrentnode,leavingitin
*thestoppedstate.Thisisjustahelperfor
*kthread_create_on_node();
*seethedocumentationthereformoredetails.
*/
#definekthread_create(threadfn,data,namefmt,arg...)
kthread_create_on_node(threadfn,data,NUMA_NO_NODE,namefmt,##arg)
structtask_struct*kthread_create_on_cpu(int(*threadfn)(void*data),
void*data,
unsignedintcpu,
constchar*namefmt);
/**
*kthread_run-createandwakeathread.
*@threadfn:thefunctiontorununtilsignal_pending(current).
*@data:dataptrfor@threadfn.
*@namefmt:printf-stylenameforthethread.
*
*Description:Convenientwrapperforkthread_create()followedby
*wake_up_process().ReturnsthekthreadorERR_PTR(-ENOMEM).
*/
#definekthread_run(threadfn,data,namefmt,...)
({
structtask_struct*__k
=kthread_create(threadfn,data,namefmt,##__VA_ARGS__);
if(!IS_ERR(__k))
wake_up_process(__k);
__k;
})
kthread_create_on_node()instantiatesdetails(receivedasarguments)ofkthreadtobe
createdintoastructureoftypekthread_create_infoandqueuesitatthetailof
kthread_create_list.Itthenwakesupkthreaddandwaitsforthreadcreationto
complete:
/*kernel/kthread.c*/
staticstructtask_struct*__kthread_create_on_node(int(*threadfn)(void*data),
void*data,intnode,
constcharnamefmt[],
va_listargs)
{
DECLARE_COMPLETION_ONSTACK(done);
structtask_struct*task;
structkthread_create_info*create=kmalloc(sizeof(*create),
GFP_KERNEL);
if(!create)
returnERR_PTR(-ENOMEM);
create->threadfn=threadfn;
create->data=data;
create->node=node;
create->done=&done;
spin_lock(&kthread_create_lock);
list_add_tail(&create->list,&kthread_create_list);
spin_unlock(&kthread_create_lock);
wake_up_process(kthreadd_task);
/*
*Waitforcompletioninkillablestate,forImightbechosenby
*theOOMkillerwhilekthreaddistryingtoallocatememoryfor
*newkernelthread.
*/
if(unlikely(wait_for_completion_killable(&done))){
/*
*IfIwasSIGKILLedbeforekthreadd(ornewkernelthread)
*callscomplete(),leavethecleanupofthisstructureto
*thatthread.
*/
if(xchg(&create->done,NULL))
returnERR_PTR(-EINTR);
/*
*kthreadd(ornewkernelthread)willcallcomplete()
*shortly.
*/
wait_for_completion(&done);//wakeuponcompletionofthreadcreation.
}
...
...
...
}
structtask_struct*kthread_create_on_node(int(*threadfn)(void*data),
void*data,intnode,
constcharnamefmt[],
...)
{
structtask_struct*task;
va_listargs;
va_start(args,namefmt);
task=__kthread_create_on_node(threadfn,data,node,namefmt,args);
va_end(args);
returntask;
}
Recallthatkthreaddinvokesthecreate_thread()routinetostartkernelthreadsasper
dataqueuedintothelist.Thisroutinecreatesthethreadandsignalscompletion:
/*kernel/kthread.c*/
staticvoidcreate_kthread(structkthread_create_info*create)
{
intpid;
#ifdefCONFIG_NUMA
current->pref_node_fork=create->node;
#endif
/*Wewantourownsignalhandler(wetakenosignalsbydefault).*/
pid=kernel_thread(kthread,create,CLONE_FS|CLONE_FILES|
SIGCHLD);
if(pid<0){
/*IfuserwasSIGKILLed,Ireleasethestructure.*/
structcompletion*done=xchg(&create->done,NULL);
if(!done){
kfree(create);
return;
}
create->result=ERR_PTR(pid);
complete(done);/*signalcompletionofthreadcreation*/
}
}
/*kernel/fork.c*/<br/>/*<br/>*Createakernelthread.<br/>*/
<strong>pid_tkernel_thread</strong>(int(*fn)(void*),void*arg,
unsignedlongflags)<br/>{<br/><strong>return
_do_fork(flags|CLONE_VM|CLONE_UNTRACED,(unsigned
long)fn,</strong><br/><strong>(unsignedlong)arg,NULL,NULL,
0);</strong><br/>}<br/><br/>/*sys_fork:createachildprocessby
duplicatingcaller*/<br/><strong>SYSCALL_DEFINE0(fork)
</strong><br/>{<br/>#ifdefCONFIG_MMU<br/><strong>return
_do_fork(SIGCHLD,0,0,NULL,NULL,0);</strong>
<br/>#else<br/>/*cannotsupportinnommumode*/<br/>return-
EINVAL;<br/>#endif<br/>}<br/><br/>/*sys_vfork:createvfork
childprocess*/<br/><strong>SYSCALL_DEFINE0(vfork)</strong>
<br/>{<br/><strong>return_do_fork(CLONE_VFORK|
CLONE_VM|SIGCHLD,0,</strong><br/><strong>0,NULL,
NULL,0);</strong><br/>}<br/><br/>/*sys_clone:createchild
processaspercloneflags*/<br/><br/>#ifdef
__ARCH_WANT_SYS_CLONE<br/>#ifdef
CONFIG_CLONE_BACKWARDS<br/>SYSCALL_DEFINE5(clone,
unsignedlong,clone_flags,unsignedlong,newsp,<br/>int__user*,
parent_tidptr,<br/>unsignedlong,tls,<br/>int__user*,child_tidptr)
<br/>#elifdefined(CONFIG_CLONE_BACKWARDS2)
<br/>SYSCALL_DEFINE5(clone,unsignedlong,newsp,unsigned
long,clone_flags,<br/>int__user*,parent_tidptr,<br/>int__user*,
child_tidptr,<br/>unsignedlong,tls)<br/>#elif
defined(CONFIG_CLONE_BACKWARDS3)
<br/>SYSCALL_DEFINE6(clone,unsignedlong,clone_flags,
unsignedlong,newsp,<br/>int,stack_size,<br/>int__user*,
parent_tidptr,<br/>int__user*,child_tidptr,<br/>unsignedlong,tls)
<br/>#else<br/>SYSCALL_DEFINE5(clone,unsignedlong,
clone_flags,unsignedlong,newsp,<br/>int__user*,parent_tidptr,
<br/>int__user*,child_tidptr,<br/>unsignedlong,tls)
<br/>#endif<br/>{<br/><strong>return_do_fork(clone_flags,newsp,
0,parent_tidptr,child_tidptr,tls);</strong><br/>}<br/>#endif<br/>
<br/>
Processstatusandtermination
Duringthelifetimeofaprocess,ittraversesthroughmanystatesbeforeit
ultimatelyterminates.Usersmusthavepropermechanismstobeupdatedwith
allthathappenstoaprocessduringitslifetime.Linuxprovidesasetoffunctions
forthispurpose.
wait
Forprocessesandthreadscreatedbyaparent,itmightbefunctionallyusefulfor
theparenttoknowtheexecutionstatusofthechildprocess/thread.Thiscanbe
achievedusingthewaitfamilyofsystemcalls:
#include<sys/types.h>
#include<sys/wait.h>
pid_twait(int*status);
pid_twaitpid(pid_tpid,int*status,intoptions);
intwaitid(idtype_tidtype,id_tid,siginfo_t*infop,intoptions)
Thesesystemcallsupdatethecallingprocesswiththestatechangeeventsofa
child.Thefollowingstatechangeeventsarenotified:
Terminationofchild
Stoppedbyasignal
Resumedbyasignal
Inadditiontoreportingthestatus,theseAPIsallowtheparentprocesstoreapa
terminatedchild.Aprocessonterminationisputintozombiestateuntilthe
immediateparentengagesthewaitcalltoreapit.
exit
Everyprocessmustend.Processterminationisdoneeitherbytheprocess
callingexit()orwhenthemainfunctionreturns.Aprocessmayalsobe
terminatedabruptlyonreceivingasignalorexceptionthatforcesittoterminate,
suchastheKILLcommand,whichsendsasignaltokilltheprocess,orwhenan
exceptionisraised.Upontermination,theprocessisputintoexitstateuntilthe
immediateparentreapsit.
Theexitcallsthesys_exitsystemcall,whichinternallycallsthedo_exitroutine.
Thedo_exitprimarilyperformsthefollowingtasks(do_exitsetsmanyvaluesand
makesmultiplecallstorelatedkernelroutinestocompleteitstask):
Takestheexitcodereturnedbythechildtotheparent.
SetsthePF_EXITINGflag,indicatingprocessexiting.
Cleansupandreclaimstheresourcesheldbytheprocess.Thisincludes
releasingmm_struct,removalfromthequeueifitiswaitingforanIPC
semaphore,releaseoffilesystemdataandfiles,ifany,andcallingschedule()
astheprocessisnolongerexecutable.
Afterdo_exit,theprocessremainsinzombiestateandtheprocessdescriptoris
stillintactfortheparenttocollectthestatus,afterwhichtheresourcesare
reclaimedbythesystem.
Namespacesandcgroups
UsersloggedintoaLinuxsystemhaveatransparentviewofvarioussystem
entitiessuchasglobalresources,processes,kernel,andusers.Forinstance,a
validusercanaccessPIDsofallrunningprocessesonthesystem(irrespectiveof
theusertowhichtheybelong).Userscanobservethepresenceofotheruserson
thesystem,andtheycanruncommandstoviewthestateofglobalsystemglobal
resourcessuchasmemory,filesystemmounts,anddevices.Suchoperationsare
notdeemedasintrusionsorconsideredsecuritybreaches,asitisalways
guaranteedthatoneuser/processcanneverintrudeintootheruser/process.
However,suchtransparencyisunwarrantedonafewserverplatforms.For
instance,considercloudserviceprovidersofferingPaaS(platformasa
service).Theyofferanenvironmenttohostanddeploycustomclient
applications.Theymanageruntime,storage,operatingsystem,middleware,and
networkingservices,leavingcustomerstomanagetheirapplicationsanddata.
PaaSservicesareusedbyvariouse-commerce,financial,onlinegaming,and
otherrelatedenterprises.
Forefficientandeffectiveisolationandresourcemanagementforclients,PaaS
serviceprovidersusevarioustools.Theyvirtualizethesystemenvironmentfor
eachclienttoachievesecurity,reliability,androbustness.TheLinuxkernel
provideslow-levelmechanismsintheformofcgroupsandnamespacesfor
buildingvariouslightweighttoolsthatcanvirtualizethesystemenvironment.
Dockerisonesuchframeworkthatbuildsoncgroupsandnamespaces.
Namespacesfundamentallyaremechanismstoabstract,isolate,andlimitthe
visibilitythatagroupofprocesseshasovervarioussystementitiessuchas
processtrees,networkinterfaces,userIDs,andfilesystemmounts.Namespaces
arecategorizedintoseveralgroups,whichwewillnowsee.
Mountnamespaces
Traditionally,mountandunmountoperationswillchangethefilesystemviewas
seenbyallprocessesinthesystem;inotherwords,thereisoneglobalmount
namespaceseenbyallprocesses.Themountnamespacesconfinethesetof
filesystemmountpointsvisiblewithinaprocessnamespace,enablingone
processgroupinamountnamespacetohaveanexclusiveviewofthefilesystem
listcomparedtoanotherprocess.
UTSnamespaces
Theseenableisolatingthesystem'shostanddomainnamewithinauts
namespace.Thismakesinitializationandconfigurationscriptsabletobeguided
basedontherespectivenamespaces.
IPCnamespaces
ThesedemarcateprocessesfromusingSystemVandPOSIXmessagequeues.
Thispreventsoneprocessfromanipcnamespaceaccessingtheresourcesof
another.
PIDnamespaces
Traditionally,*nixkernels(includingLinux)spawntheinitprocesswithPID1
duringsystemboot,whichinturnstartsotheruser-modeprocessesandis
consideredtherootoftheprocesstree(alltheotherprocessesstartbelowthis
processinthetree).ThePIDnamespaceallowsaprocesstospinoffanewtree
ofprocessesunderitwithitsownrootprocess(PID1process).PIDnamespaces
isolateprocessIDnumbers,andallowduplicationofPIDnumbersacross
differentPIDnamespaces,whichmeansthatprocessesindifferentPID
namespacescanhavethesameprocessID.TheprocessIDswithinaPID
namespaceareunique,andareassignedsequentiallystartingwithPID1.
PIDnamespacesareusedincontainers(lightweightvirtualizationsolution)to
migrateacontainerwithaprocesstree,ontoadifferenthostsystemwithoutany
changestoPIDs.
Networknamespaces
Thistypeofnamespaceprovidesabstractionandvirtualizationofnetwork
protocolservicesandinterfaces.Eachnetworknamespacewillhaveitsown
networkdeviceinstancesthatcanbeconfiguredwithindividualnetwork
addresses.Isolationisenabledforothernetworkservices:routingtable,port
number,andsoon.
Usernamespaces
UsernamespacesallowaprocesstouseuniqueuserandgroupIDswithinand
outsideanamespace.Thismeansthataprocesscanuseprivilegeduserand
groupIDs(zero)withinausernamespaceandcontinuewithnon-zerouserand
groupIDsoutsidethenamespace.
Cgroupnamespaces
Acgroupnamespacevirtualizesthecontentsofthe/proc/self/cgroupfile.
Processesinsideacgroupnamespaceareonlyabletoviewpathsrelativetotheir
namespaceroot.
Controlgroups(cgroups)
Cgroupsarekernelmechanismstorestrictandmeasureresourceallocationsto
eachprocessgroup.Usingcgroups,youcanallocateresourcessuchasCPU
time,network,andmemory.
SimilartotheprocessmodelinLinux,whereeachprocessisachildtoaparent
andrelativelydescendsfromtheinitprocessthusformingasingle-treelike
structure,cgroupsarehierarchical,wherechildcgroupsinherittheattributesof
theparent,butwhatmakesisdifferentisthatmultiplecgrouphierarchiescan
existwithinasinglesystem,witheachhavingdistinctresourceprerogatives.
Applyingcgroupsonnamespacesresultsinisolationofprocessesintocontainers
withinasystem,whereresourcesaremanageddistinctly.Eachcontainerisa
lightweightvirtualmachine,allofwhichrunasindividualentitiesandare
obliviousofotherentitieswithinthesamesystem.
ThefollowingarenamespaceAPIsdescribedintheLinuxmanpagefor
namespaces:
clone(2)
Theclone(2)systemcallcreatesanewprocess.IftheflagsargumentofthecallspecifiesoneormoreoftheCLONE_NEW*flagslistedbelow,thennewnamespacesarecreatedforeachflag,andthechildprocessismadeamemberofthosenamespaces.(Thissystemcallalsoimplementsanumberoffeaturesunrelatedtonamespaces.)
setns(2)
Thesetns(2)systemcallallowsthecallingprocesstojoinanexistingnamespace.Thenamespacetojoinisspecifiedviaafiledescriptorthatreferstooneofthe/proc/[pid]/nsfilesdescribedbelow.
unshare(2)
Theunshare(2)systemcallmovesthecallingprocesstoanewnamespace.IftheflagsargumentofthecallspecifiesoneormoreoftheCLONE_NEW*flagslistedbelow,thennewnamespacesarecreatedforeachflag,andthecallingprocessismadeamemberofthosenamespaces.(Thissystemcallalsoimplementsanumberoffeaturesunrelatedtonamespaces.)
NamespaceConstantIsolates
CgroupCLONE_NEWCGROUPCgrouprootdirectory
IPCCLONE_NEWIPCSystemVIPC,POSIXmessagequeues
NetworkCLONE_NEWNETNetworkdevices,stacks,ports,etc.
MountCLONE_NEWNSMountpoints
PIDCLONE_NEWPIDProcessIDs
UserCLONE_NEWUSERUserandgroupIDs
UTSCLONE_NEWUTSHostnameandNISdomainname
Summary
WeunderstoodoneoftheprincipalabstractionsofLinuxcalledtheprocess,and
thewholeecosystemthatfacilitatesthisabstraction.Thechallengenowremains
inrunningthescoresofprocessesbyprovidingfairCPUtime.Withmany-core
systemsimposingamultitudeofprocesseswithdiversepoliciesandpriorities,
theneedfordeterministicschedulingisparamount.
Inournextchapter,wewilldelveintoprocessscheduling,anothercriticalaspect
ofprocessmanagement,andcomprehendhowtheLinuxschedulerisdesignedto
handlethisdiversity.
DecipheringtheProcessScheduler
Processschedulingisoneofthemostcrucialexecutivejobsofanyoperating
system,Linuxbeingnodifferent.Theheuristicsandefficiencyinscheduling
processesiswhatmakeanyoperatingsystemtickandalsogiveitanidentity,
suchasageneral-purposeoperatingsystem,server,orareal-timesystem.Inthis
chapter,wewillgetundertheskinoftheLinuxscheduler,decipheringconcepts
suchas:
Linuxschedulerdesign
Schedulingclasses
Schedulingpoliciesandpriorities
CompletelyFairScheduler
Real-TimeScheduler
DeadlineScheduler
Groupscheduling
Preemption
Processschedulers
Theeffectivenessofanyoperatingsystemisproportionaltoitsabilitytofairly
scheduleallcontendingprocesses.Theprocessscheduleristhecorecomponent
ofthekernel,whichcomputesanddecideswhenandforhowlongaprocessgets
CPUtime.Ideally,processesrequireatimesliceoftheCPUtorun,soschedulers
essentiallyneedtoallocateslicesofprocessortimefairlyamongprocesses.
Aschedulertypicallyhasto:
Avoidprocessstarvation
Managepriorityscheduling
Maximizethroughputofallprocesses
Ensurelowturnaroundtime
Ensureevenresourceusage
AvoidCPUhogging
Considerprocess'behavioralpatternsforprioritization
Elegantlysubsidizeunderheavyload
Handleschedulingonmultiplecoresefficiently
Linuxprocessschedulerdesign
Linux,whichwasprimarilydevelopedfordesktopsystems,hasunassumingly
evolvedintoamulti-dimensionaloperatingsystemwithitsusagespreadacross
embeddeddevices,mainframes,andsupercomputerstoroom-sizedservers.It
hasalsoseamlesslyaccommodatedtheever-evolvingdiversecomputing
platformssuchasSMP,virtualization,andreal-timesystems.Thediversityof
theseplatformsisbroughtforthbythekindofprocessesthatrunonthese
systems.Forinstance,ahighlyinteractivedesktopsystemmayrunprocesses
thatareI/Obound,andareal-timesystemthrivesondeterministicprocesses.
Everykindofprocessthuscallsforadifferentkindofheuristicwhenitneedsto
befairlyscheduled,asaCPU-intensiveprocessmayrequiremoreCPUtime
thananormalprocess,andareal-timeprocesswouldrequiredeterministic
execution.Linux,whichcaterstoawidespectrumofsystems,isthusconfronted
withaddressingthevaryingschedulingchallengesthatcomealongwhen
managingthesediverseprocesses.
TheintrinsicdesignofLinux'sprocessschedulerelegantlyanddeftlyhandles
thischallengebyadoptingasimpletwo-layeredmodel,withitsfirstlayer,the
GenericScheduler,definingabstractoperationsthatserveasentryfunctionsfor
thescheduler,andthesecondlayer,theschedulingclass,implementingthe
actualschedulingoperations,whereeachclassisdedicatedtohandlingthe
schedulingheuristicsofaparticularkindofprocess.Thismodelenablesthe
genericschedulertoremainabstractedfromtheimplementationdetailsofevery
schedulerclass.Forinstance,normalprocesses(I/Obound)canbehandledby
oneclass,andprocessesthatrequiredeterministicexecution,suchasreal-time
processes,canbehandledbyanotherclass.Thisarchitecturealsoenablesadding
anewschedulingclassseamlessly.Thepreviousfiguredepictsthelayered
designoftheprocessscheduler.
Thegenericschedulerdefinesabstractinterfacesthroughastructurecalled
sched_class:
structsched_class{
conststructsched_class*next;
void(*enqueue_task)(structrq*rq,structtask_struct*p,intflags);
void(*dequeue_task)(structrq*rq,structtask_struct*p,intflags);
void(*yield_task)(structrq*rq);
bool(*yield_to_task)(structrq*rq,structtask_struct*p,boolpreempt);
void(*check_preempt_curr)(structrq*rq,structtask_struct*p,intflags);
/*
*Itistheresponsibilityofthepick_next_task()methodthatwill
*returnthenexttasktocallput_prev_task()onthe@prevtaskor
*somethingequivalent.
*
*MayreturnRETRY_TASKwhenitfindsahigherprioclasshasrunnable
*tasks.
*/
structtask_struct*(*pick_next_task)(structrq*rq,
structtask_struct*prev,
structrq_flags*rf);
void(*put_prev_task)(structrq*rq,structtask_struct*p);
#ifdefCONFIG_SMP
int(*select_task_rq)(structtask_struct*p,inttask_cpu,intsd_flag,intflags);
void(*migrate_task_rq)(structtask_struct*p);
void(*task_woken)(structrq*this_rq,structtask_struct*task);
void(*set_cpus_allowed)(structtask_struct*p,
conststructcpumask*newmask);
void(*rq_online)(structrq*rq);
void(*rq_offline)(structrq*rq);
#endif
void(*set_curr_task)(structrq*rq);
void(*task_tick)(structrq*rq,structtask_struct*p,intqueued);
void(*task_fork)(structtask_struct*p);
void(*task_dead)(structtask_struct*p);
/*
*Theswitched_from()callisallowedtodroprq->lock,thereforewe
*cannotassumetheswitched_from/switched_topairisserializedby
*rq->lock.Theyarehoweverserializedbyp->pi_lock.
*/
void(*switched_from)(structrq*this_rq,structtask_struct*task);
void(*switched_to)(structrq*this_rq,structtask_struct*task);
void(*prio_changed)(structrq*this_rq,structtask_struct*task,
intoldprio);
unsignedint(*get_rr_interval)(structrq*rq,
structtask_struct*task);
void(*update_curr)(structrq*rq);
#defineTASK_SET_GROUP0
#defineTASK_MOVE_GROUP1
#ifdefCONFIG_FAIR_GROUP_SCHED
void(*task_change_group)(structtask_struct*p,inttype);
#endif
};
Everyschedulerclassimplementsoperationsasdefinedinthesched_class
structure.Asofthe4.12.xkernel,therearethreeschedulingclasses:the
CompletelyFairScheduling(CFS)class,Real-TimeSchedulingclass,and
DeadlineSchedulingclass,witheachclasshandlingprocesseswithspecific
schedulingrequirements.Thefollowingcodesnippetsshowhoweachclass
populatesitsoperationsasperthesched_classstructure.
CFSclass:
conststructsched_classfair_sched_class={
.next=&idle_sched_class,
.enqueue_task=enqueue_task_fair,
.dequeue_task=dequeue_task_fair,
.yield_task=yield_task_fair,
.yield_to_task=yield_to_task_fair,
.check_preempt_curr=check_preempt_wakeup,
.pick_next_task=pick_next_task_fair,
.put_prev_task=put_prev_task_fair,
....
}
Real-TimeSchedulingclass:
conststructsched_classrt_sched_class={
.next=&fair_sched_class,
.enqueue_task=enqueue_task_rt,
.dequeue_task=dequeue_task_rt,
.yield_task=yield_task_rt,
.check_preempt_curr=check_preempt_curr_rt,
.pick_next_task=pick_next_task_rt,
.put_prev_task=put_prev_task_rt,
....
}
DeadlineSchedulingclass:
conststructsched_classdl_sched_class={
.next=&rt_sched_class,
.enqueue_task=enqueue_task_dl,
.dequeue_task=dequeue_task_dl,
.yield_task=yield_task_dl,
.check_preempt_curr=check_preempt_curr_dl,
.pick_next_task=pick_next_task_dl,
.put_prev_task=put_prev_task_dl,
....
}
Runqueue
Conventionally,therunqueuecontainsalltheprocessesthatarecontendingfor
CPUtimeonagivenCPUcore(arunqueueisper-CPU).Thegenericscheduler
isdesignedtolookintotherunqueuewheneveritisinvokedtoschedulethenext
bestrunnabletask.Maintainingacommonrunqueueforalltherunnable
processeswouldnotbeapossiblesinceeachschedulingclassdealswithspecific
schedulingpoliciesandpriorities.
Thekerneladdressesthisbybringingitsdesignprinciplestothefore.Each
schedulingclassdefinedthelayoutofitsrunqueuedatastructureasbestsuitable
foritspolicies.Thegenericschedulerlayerimplementsanabstractrunqueue
structurewithcommonelementsthatservesastherunqueueinterface.This
structureisextendedwithpointersthatrefertoclass-specificrunqueues.Inother
words,allschedulingclassesembedtheirrunqueuesintothemainrunqueue
structure.Thisisaclassicdesignhack,whichletseveryschedulerclasschoose
anappropriatelayoutforitsrunqueuedatastructure.
Thefollowingcodesnippetofstructrq(runqueue)willhelpuscomprehendthe
concept(elementsrelatedtoSMPhavebeenomittedfromthestructuretokeep
ourfocusonwhat'srelevant):
structrq{
/*runqueuelock:*/
raw_spinlock_tlock;
/*
*nr_runningandcpu_loadshouldbeinthesamecachelinebecause
*remoteCPUsuseboththesefieldswhendoingloadcalculation.
*/
unsignedintnr_running;
#ifdefCONFIG_NUMA_BALANCING
unsignedintnr_numa_running;
unsignedintnr_preferred_running;
#endif
#defineCPU_LOAD_IDX_MAX5
unsignedlongcpu_load[CPU_LOAD_IDX_MAX];
#ifdefCONFIG_NO_HZ_COMMON
#ifdefCONFIG_SMP
unsignedlonglast_load_update_tick;
#endif/*CONFIG_SMP*/
unsignedlongnohz_flags;
#endif/*CONFIG_NO_HZ_COMMON*/
#ifdefCONFIG_NO_HZ_FULL
unsignedlonglast_sched_tick;
#endif
/*captureloadfrom*all*tasksonthiscpu:*/
structload_weightload;
unsignedlongnr_load_updates;
u64nr_switches;
structcfs_rqcfs;
structrt_rqrt;
structdl_rqdl;
#ifdefCONFIG_FAIR_GROUP_SCHED
/*listofleafcfs_rqonthiscpu:*/
structlist_headleaf_cfs_rq_list;
structlist_head*tmp_alone_branch;
#endif/*CONFIG_FAIR_GROUP_SCHED*/
unsignedlongnr_uninterruptible;
structtask_struct*curr,*idle,*stop;
unsignedlongnext_balance;
structmm_struct*prev_mm;
unsignedintclock_skip_update;
u64clock;
u64clock_task;
atomic_tnr_iowait;
#ifdefCONFIG_IRQ_TIME_ACCOUNTING
u64prev_irq_time;
#endif
#ifdefCONFIG_PARAVIRT
u64prev_steal_time;
#endif
#ifdefCONFIG_PARAVIRT_TIME_ACCOUNTING
u64prev_steal_time_rq;
#endif
/*calc_loadrelatedfields*/
unsignedlongcalc_load_update;
longcalc_load_active;
#ifdefCONFIG_SCHED_HRTICK
#ifdefCONFIG_SMP
inthrtick_csd_pending;
structcall_single_datahrtick_csd;
#endif
structhrtimerhrtick_timer;
#endif
...
#ifdefCONFIG_CPU_IDLE
/*Mustbeinspectedwithinarculocksection*/
structcpuidle_state*idle_state;
#endif
};
Youcanseehowtheschedulingclasses(cfs,rt,anddl)embedthemselvesinto
therunqueue.Otherelementsofinterestintherunqueueare:
nr_running:Thisdenotesthenumberofprocessesintherunqueue
load:Thisdenotesthecurrentloadonthequeue(allrunnableprocesses)
currandidle:Thesepointtothetask_structofthecurrentrunningtaskand
theidletask,respectively.Theidletaskisscheduledwhenthereareno
othertaskstorun.
Thescheduler'sentrypoint
Theprocessofschedulingstartswithacalltothegenericscheduler,thatis,the
schedule()function,definedin<kernel/sched/core.c>.Thisisperhapsoneofthemost
invokedroutinesinthekernel.Thefunctionalityofschedule()istopickthenext
bestrunnabletask.Thepick_next_task()oftheschedule()functioniteratesthrough
allthecorrespondingfunctionscontainedintheschedulerclassesandendsup
pickingthenextbesttasktorun.Eachschedulerclassislinkedusingasingle
linkedlist,whichenablesthepick_next_task()toiteratethroughtheseclasses.
ConsideringthatLinuxwasprimarilydesignedtocatertohighlyinteractive
systems,thefunctionfirstlooksforthenextbestrunnabletaskintheCFSclass
iftherearenohigher-priorityrunnabletasksinanyoftheotherclasses(thisis
donebycheckingwhetherthetotalnumberofrunnabletasks(nr_running)inthe
runqueueisequaltothetotalnumberofrunnabletasksintheCFSclass'ssub-
runqueue);else,ititeratesthroughalltheotherclassesandpicksthenextbest
runnabletask.Finally,ifnotasksarefound,itinvokestheidle,backgroundtasks
(whichalwaysreturnsanon-nullvalue).
Thefollowingcodeblockshowstheimplementationofpick_next_task():/*
*Pickupthehighest-priotask:
*/
staticinlinestructtask_struct*
pick_next_task(structrq*rq,structtask_struct*prev,structrq_flags*rf)
{
conststructsched_class*class;
structtask_struct*p;
/*
*Optimization:weknowthatifalltasksareinthefairclasswecan
*callthatfunctiondirectly,butonlyifthe@prevtaskwasn'tofa
*higherschedulingclass,becauseotherwisethoseloosethe
*opportunitytopullinmoreworkfromotherCPUs.
*/
if(likely((prev->sched_class==&idle_sched_class||
prev->sched_class==&fair_sched_class)&&
rq->nr_running==rq->cfs.h_nr_running)){
p=fair_sched_class.pick_next_task(rq,prev,rf);
if(unlikely(p==RETRY_TASK))
gotoagain;
/*Assumesfair_sched_class->next==idle_sched_class*/
if(unlikely(!p))
p=idle_sched_class.pick_next_task(rq,prev,rf);
returnp;
}
again:
for_each_class(class){
p=class->pick_next_task(rq,prev,rf);
if(p){
if(unlikely(p==RETRY_TASK))
gotoagain;
returnp;
}
}
/*Theidleclassshouldalwayshavearunnabletask:*/
BUG();
}
Processpriorities
Thedecisionofwhichprocesstorundependsonthepriorityoftheprocess.
Everyprocessislabelledwithapriorityvalue,givingitanimmediatepositionin
termsofwhenitwillbegivenCPUtime.Prioritiesarefundamentallyclassified
intodynamicandstaticprioritieson*nixsystems.Dynamicprioritiesare
basicallyappliedtonormalprocessesdynamicallybythekernel,considering
variousfactorssuchasthenicevalueoftheprocess,itshistoricbehavior(I/O
boundorprocessorbound),lapsedexecution,andwaitingtime.Staticpriorities
areappliedtoreal-timeprocessesbytheuserandthekerneldoesnotchange
theirprioritiesdynamically.Processeswithstaticprioritiesarethusgivenhigher
prioritywhenscheduling.
I/Oboundprocess:Whentheexecutionofaprocessisheavily
punctuatedwithI/Ooperations(waitingforaresourceoran
event),forinstanceatexteditor,whichalmostalternatesbetween
runningandwaitingforakeypress,suchprocessesarecalledI/O
bound.Duetothisnature,theschedulernormallyallocatesshort
processortimeslicestoI/O-boundprocessesandmultiplexesthem
withotherprocesses,addingtheoverheadofcontextswitchingand
thesubsequentheuristicsofcomputingthenextbestprocesstorun.
Processorboundprocess:Theseareprocessesthatlovetostickon
toCPUtimeslices,astheyrequiremaximumutilizationofthe
processor'scomputingcapacity.Processesrequiringheavy
computationssuchascomplexscientificcalculations,andvideo
renderingcodecsareprocessorbound.Thoughtheneedfora
longerCPUslicelooksdesirable,theexpectationtorunthem
underfixedtimeperiodsisnotoftenarequirement.Schedulerson
interactiveoperatingsystemstendtofavormoreI/O-bound
processesthanprocessor-boundones.Linux,whichaimsforgood
interactiveperformance,ismoreoptimizedforfasterresponsetime,
incliningtowardsI/Oboundprocesses,eventhoughprocessor-
boundprocessesarerunlessfrequentlytheyareideallygiven
longertimeslicestorun.
Processescanalsobemulti-faceted,withanI/O-boundprocess
needingtoperformseriousscientificcomputations,burningthe
CPU.
Thenicevalueofanynormalprocessrangesbetween19(lowestpriority)and
-20(highestpriority),with0beingthedefaultvalue.Ahighernicevalue
indicatesalowerpriority(theprocessisbeingnicertootherprocesses).Real-
timeprocessesareprioritizedbetween0and99(staticpriority).Allthese
priorityrangesarefromtheperspectiveoftheuser.
Kernel'sperspectiveofpriorities
Linuxhoweverlooksatprocessprioritiesfromitsownperspective.Itaddsalot
morecomputationforarrivingatthepriorityofaprocess.Basically,itscalesall
prioritiesbetween0to139,where0to99isassignedforreal-timeprocessesand
100to139representsthenicevaluerange(-20to19).
Schedulerclasses
Let'snowgodeeperintoeachschedulingclassandunderstandtheoperations,
policies,andheuristicsitengagesinmanagingschedulingoperationsadeptly
andelegantlyforitsprocesses.Asmentionedearlier,aninstanceofstruct
sched_classmustbeprovidedbyeachschedulingclass;let'slookatsomeofthe
keyelementsfromthatstructure:
enqueue_task:Basicallyaddsanewprocesstotherunqueue
dequeue_task:Whentheprocessistakenofftherunqueue
yield_task:WhentheprocesswantstorelinquishCPUvoluntarily
pick_next_task:Thecorrespondingfunctionofthepick_next_taskcalledby
schedule().Itpicksupthenextbestrunnabletaskfromitsclass.
CompletelyFairSchedulingclass
(CFS)
AllprocesseswithdynamicprioritiesarehandledbytheCFSclass,andasmost
processesingeneral-purpose*nixsystemsarenormal(non-realtime),CFS
remainsthebusiestschedulerclassinthekernel.
CFSreliesonmaintainingbalanceinallocatingprocessortimetotasks,based
onpoliciesanddynamicprioritiesassignedpertask.Processschedulingunder
CFSisimplementedunderthepremisethatithasan"ideal,precisemulti-tasking
CPU,"thatequallypowersallprocessesatitspeakcapacity.Forinstance,if
therearetwoprocesses,theperfectlymulti-taskingCPUensuresthatboth
processesrunsimultaneously,eachutilizing50%ofitspower.Asthisis
practicallyimpossible(achievingparallelism),CFSallocatesprocessortimetoa
processbymaintainingproperbalanceacrossallcontendingprocesses.Ifa
processfailstoreceiveafairamountoftime,itisconsideredoutofbalance,and
thusgoesinnextasthebestrunnableprocess.
CFSdoesnotrelyonthetraditionaltimeslicesforallocatingprocessortime,but
ratherusesaconceptofvirtualruntime(vruntime):itdenotestheamountoftime
aprocessgotCPUtime,whichmeansalowvruntimevalueindicatesthatthe
processisprocessordeprivedandahighvruntimevaluedenotesthattheprocess
acquiredconsiderableprocessortime.Processeswithlowvruntimevaluesget
maximumprioritywhenscheduling.CFSalsoengagessleeperfairnessfor
processesthatareideallywaitingforanI/Orequest.Sleeperfairnessdemands
thatwaitingprocessesbegivenconsiderableCPUtimewhentheyeventually
wakeup,postevent.Basedonthevruntimevalue,CFSdecideswhatamountof
timetheprocessistorun.Italsousesthenicevaluetoweighaprocessin
relationtoallcontendingprocesses:ahigher-value,low-priorityprocessgets
lessweight,andalower-value,high-prioritytaskgetsmoreweight.Even
handlingprocesseswithvaryingprioritiesiselegantinLinux,asalower-priority
taskgetsconsiderablefactorsofdelaycomparedtoahigher-prioritytask;this
makesthetimeallocatedtoalow-prioritytaskdissipatequickly.
Computingprioritiesandtimeslices
underCFS
Prioritiesareassignedbasedonhowlongtheprocessiswaiting,howlongthe
processran,theprocess'shistoricalbehavior,anditsnicevalue.Normally,
schedulersengagecomplexalgorithmstoendupwiththenextbestprocessto
run.
Incomputingthetimesliceeveryprocessgets,CFSnotjustreliesonthenice
valueoftheprocessbutalsolooksattheloadweightoftheprocess.Forevery
jumpinthenicevalueofaprocessby1,therewillbea10%reductioninthe
CPUtimeslice,andforeverydecreaseinthenicevalueby1,therewillbea10%
additionintheCPUtimeslice,indicatingthatnicevaluesaremultiplicativebya
10%changeforeveryjump.Tocomputetheloadweightforcorrespondingnice
values,thekernelmaintainsanarraycalledprio_to_weight,whereeachnicevalue
correspondstoaweight:staticconstintprio_to_weight[40]={
/*-20*/88761,71755,56483,46273,36291,
/*-15*/29154,23254,18705,14949,11916,
/*-10*/9548,7620,6100,4904,3906,
/*-5*/3121,2501,1991,1586,1277,
/*0*/1024,820,655,526,423,
/*5*/335,272,215,172,137,
/*10*/110,87,70,56,45,
/*15*/36,29,23,18,15,
};
Theloadvalueofaprocessisstoredintheweightfieldofstructload_weight.
Likeaprocess'sweight,therunqueueofCFSisalsoassignedaweight,whichis
thegrossweightofallthetasksintherunqueue.Nowthetimesliceiscomputed
byfactoringtheentity'sloadweight,therunqueue'sloadweight,andthe
sched_period(schedulingperiod).
CFS'srunqueue
CFSshedstheneedforanormalrunqueueandusesaself-balancing,red-black
treeinsteadtogettothenextbestprocesstorunintheshortestpossibletime.
TheRBtreeholdsallthecontendingprocessesandfacilitateseasyandquick
insertion,deletion,andsearchingofprocesses.Thehighest-priorityprocessis
placedtoitsleftmostnode.Thepick_next_task()functionnowjustpicksthe
leftmostnodefromtherbtreetoschedule.
structsched_entity{<br/>structload_weightload;/*forload-
balancing*/<br/>structrb_noderun_node;<br/>structlist_head
group_node;<br/>unsignedinton_rq;<br/><br/>u64exec_start;
<br/>u64sum_exec_runtime;<br/>u64vruntime;<br/>u64
prev_sum_exec_runtime;<br/><br/>u64nr_migrations;<br/><br/>
#ifdefCONFIG_SCHEDSTATS<br/>structsched_statisticsstatistics;
<br/>#endif<br/><br/>#ifdefCONFIG_FAIR_GROUP_SCHED<br/>
intdepth;<br/>structsched_entity*parent;<br/>/*rqonwhichthis
entityis(tobe)queued:*/<br/>structcfs_rq*cfs_rq;<br/>/*rq
"owned"bythisentity/group:*/<br/>structcfs_rq*my_q;
<br/>#endif<br/><br/>....<br/>};
load:Denotestheamountofloadeachentitybearsonthetotal
loadofthequeue
vruntime:Denotestheamountoftimetheprocessran
/*taskgrouprelatedinformation*/<br/>structtask_group{<br/>
structcgroup_subsys_statecss;<br/><br/>#ifdef
CONFIG_FAIR_GROUP_SCHED<br/>/*schedulableentitiesofthis
grouponeachcpu*/<br/>structsched_entity**se;<br/>/*runqueue
"owned"bythisgrouponeachcpu*/<br/>structcfs_rq**cfs_rq;
<br/>unsignedlongshares;<br/><br/>#ifdefCONFIG_SMP<br/>/*
<br/>*load_avgcanbeheavilycontendedatclockticktime,so
put<br/>*itinitsowncachelineseparatedfromthefieldsabove
which<br/>*willalsobeaccessedateachtick.<br/>*/<br/>
atomic_long_tload_avg____cacheline_aligned;
<br/>#endif<br/>#endif<br/><br/>#ifdef
CONFIG_RT_GROUP_SCHED<br/>structsched_rt_entity**rt_se;
<br/>structrt_rq**rt_rq;<br/><br/>structrt_bandwidth
rt_bandwidth;<br/>#endif<br/><br/>structrcu_headrcu;<br/>struct
list_headlist;<br/><br/>structtask_group*parent;<br/>struct
list_headsiblings;<br/>structlist_headchildren;<br/><br/>#ifdef
CONFIG_SCHED_AUTOGROUP<br/>structautogroup*autogroup;
<br/>#endif<br/><br/>structcfs_bandwidthcfs_bandwidth;<br/>};
NoweverytaskgrouphasaschedulingentityforeveryCPUcore
alongwithaCFSrunqueueassociatedwithit.Whenataskfromone
taskgroupmigratesfromoneCPUcore(x)toanotherCPUcore(y),
thetaskisdequeuedfromtheCFSrunqueueofCPUxandenqueued
totheCFSrunqueueofCPUy.
Schedulingpolicies
Schedulingpoliciesareappliedtoprocesses,andhelpindeterminingscheduling
decisions.Ifyourecall,inChapter1,ComprehendingProcesses,AddressSpace,
andThreads,wedescribedtheintpolicyfieldundertheschedulingattributesof
structtask_struct.Thepolicyfieldcontainsthevalueindicatingwhichpolicyisto
beappliedtotheprocesswhenscheduling.TheCFSclasshandlesallnormal
processesusingthefollowingtwopolicies:
SCHED_NORMAL(0):Thisisusedforallnormalprocesses.Allnon-realtime
processescanbesummarizedasnormalprocesses.AsLinuxaimstobea
highlyresponsiveandinteractivesystem,mostoftheschedulingactivity
andheuristicsarecenteredtofairlyschedulenormalprocesses.Normal
processesarereferredtoasSCHED_OTHERasperPOSIX.
SCHED_BATCH(3):Normallyinservers,whereprocessesarenon-interactive,
CPU-boundbatchprocessingisemployed.TheseprocessesthatareCPU
intensivearegivenlessprioritythanaSCHED_NORMALprocess,andtheydonot
preemptnormalprocesses,whicharescheduled.
TheCFSclassalsohandlesschedulingtheidleprocess,whichisspecified
bythefollowingpolicy:
SCHED_IDLE(5):Whentherearenoprocessestorun,theidleprocess(low-
prioritybackgroundprocesses)isscheduled.Theidleprocessisassigned
theleastpriorityamongallprocesses.
Real-timeschedulingclass
Linuxsupportssoftreal-timetasksandtheyarescheduledbythereal-time
schedulingclass.rtprocessesareassignedstaticprioritiesandareunchanged
dynamicallybythekernel.Asreal-timetasksaimatdeterministicrunsand
desirecontroloverwhenandhowlongtheyaretobescheduled,theyarealways
givenpreferenceovernormaltasks(SCHED_NORMAL).UnlikeCFS,whichusesrbtree
asitssub-runqueue,thertscheduler,whichislesscomplicated,usesasimple
linkedlistperpriorityvalue(1to99).Linuxappliestworeal-timepolicies,rr
andfifo,whenschedulingstaticpriorityprocesses;theseareindicatedbythe
policyelementofstructtask_struct.
SCHED_FIFO(1):Thisusesthefirstin,firstoutmethodtoschedulesoftreal-
timeprocesses
SCHED_RR(2):Thisistheround-robinpolicyusedtoschedulesoftreal-time
processes
FIFO
FIFOisaschedulingmechanismappliedtoprocesseswithprioritieshigherthan
0(0isassignedtonormalprocesses).FIFOprocessesrunsansanytimeslice
allocation;inotherwords,theyinvariablyrununtiltheyblockforsomeeventor
explicitlyyieldtoanotherprocess.AFIFOprocessalsogetspreemptedwhenthe
schedulerencountersahigher-priorityrunnableFIFO,RR,ordeadlinetask.
Whenschedulerencountersmorethanonefifotaskwiththesamepriority,it
runstheprocessesinroundrobin,startingwiththefirstprocessattheheadofthe
list.Onpreemption,theprocessisaddedbacktothetailofthelist.Ifahigher-
priorityprocesspreemptstheFIFOprocess,itwaitsattheheadofthelist,and
whenallotherhigh-prioritytasksarepreempted,itisagainpickeduptorun.
Whenanewfifoprocessbecomesrunnable,itisaddedtothetailofthelist.
RR
Theround-robinpolicyissimilartoFIFO,withtheonlyexceptionbeingthatit
isallocatedatimeslicetorun.ThisiskindofanenhancementtoFIFO(asa
FIFOprocessmayrununtilityieldsorwaits).SimilartoFIFO,theRRprocess
attheheadofthelistispickedforexecution(ifnootherhigher-prioritytaskis
available)andoncompletionofthetimeslicegetspreemptedandisaddedback
tothetailendofthelist.RRprocesseswiththesamepriorityrunroundrobin
untilpreemptedbyahigh-prioritytask.Whenahigh-prioritytaskpreemptsan
RRtask,itwaitsattheheadofthelist,andonresumptionrunsfortheremainder
ofitstimesliceonly.
structsched_rt_entity{<br/>structlist_headrun_list;<br/>unsigned
longtimeout;<br/>unsignedlongwatchdog_stamp;<br/>unsignedint
time_slice;<br/>unsignedshorton_rq;<br/>unsignedshorton_list;
<br/><br/>structsched_rt_entity*back;<br/>#ifdef
CONFIG_RT_GROUP_SCHED<br/>structsched_rt_entity*parent;
<br/>/*rqonwhichthisentityis(tobe)queued:*/<br/>structrt_rq
*rt_rq;<br/>/*rq"owned"bythisentity/group:*/<br/>structrt_rq
*my_q;<br/>#endif<br/>};
Deadlineschedulingclass(sporadic
taskmodeldeadlinescheduling)
DeadlinerepresentsthenewbreedofRTprocessesonLinux(addedsincethe
3.14kernel).UnlikeFIFOandRR,whereprocessesmayhogCPUorbebound
bytimeslices,adeadlineprocess,whichisbasedonGEDF(GlobalEarliest
DeadlineFirst)andCBS(ConstantBandwidthServer)algorithms,
predeterminesitsruntimerequirements.Asporadicprocessinternallyruns
multipletasks,witheachtaskhavingarelativedeadlinewithinwhichitmust
completeexecutingandacomputationtime,definingthetimethattheCPU
needstocompleteprocessexecution.Toensurethatthekernelsucceedsin
executingdeadlineprocesses,thekernelrunsanadmittancetestbasedonthe
deadlineparameters,andonfailurereturnsanerror,EBUSY.Processeswiththe
deadlinepolicygetsprecedenceoverallotherprocesses.Deadlineprocessesuse
SCHED_DEADLINE(6)astheirpolicyelement.
Schedulerrelatedsystemcalls
Linuxprovidesanentirefamilyofsystemcallsthatmanagevariousscheduler
parameters,policies,andprioritiesandretrieveamultitudeofscheduling-related
informationforthecallingthreads.ItalsoenablesthreadstoyieldCPU
explicitly:
nice(intinc)
nice()takesanintparameterandaddsittothenicevalueofthecallingthread.On
success,itreturnsthenewnicevalueofthethread.Nicevaluesarewithinthe
range19(lowestpriority)to-20(highestpriority).Nicevaluescanbe
incrementedonlywithinthisrange:
getpriority(intwhich,id_twho)
Thisreturnsthenicevalueofthethread,group,user,orsetofthreadsofa
specifieduserasindicatedbyitsparameters.Itreturnsthehighestpriorityheld
byanyoftheprocesses:
setpriority(intwhich,id_twho,intprio)
Theschedulingpriorityofthethread,group,user,orsetofthreadsofaspecified
userasindicatedbyitsparametersissetbysetpriority.Itreturnszeroonsuccess:
sched_setscheduler(pid_tpid,intpolicy,conststructsched_param*param)
Thissetsboththeschedulingpolicyandparametersofaspecifiedthread,
indicatedbyitspid.Ifthepidiszero,thepolicyofthecallingthreadwillbeset.
Theparamargument,whichspecifiestheschedulingparameters,pointstoa
structuresched_param,whichholdsintsched_priority.sched_prioritymustbezerofor
normalprocessesandapriorityvalueintherange1to99forFIFOandRR
policies(mentionedinpolicyargument).Itreturnszeroonsuccess:
sched_getscheduler(pid_tpid)
Itreturnstheschedulingpolicyofathread(pid).Ifthepidiszero,thepolicyof
thecallingthreadwillberetrieved:
sched_setparam(pid_tpid,conststructsched_param*param)
Itsetstheschedulingparametersassociatedwiththeschedulingpolicyofthe
giventhread(pid).Ifthepidiszero,theparametersofthecallingprocessareset.
Onsuccess,itreturnszero:
sched_getparam(pid_tpid,structsched_param*param)
Thissetstheschedulingparametersforthespecifiedthread(pid).Ifthepidis
zero,theschedulingparametersofthecallingthreadwillberetrieved.On
success,itreturnszero:
sched_setattr(pid_tpid,structsched_attr*attr,unsignedintflags)
Itsetstheschedulingpolicyandrelatedattributesforthespecifiedthread(pid).If
thepidiszero,thepolicyandattributesofthecallingprocessareset.Thisisa
Linux-specificcallandisthesupersetofthefunctionalityprovidedby
sched_setscheduler()andsched_setparam()calls.Onsuccess,itreturnszero.
sched_getattr(pid_tpid,structsched_attr*attr,unsignedintsize,unsignedintflags
Itfetchestheschedulingpolicyandrelatedattributesofthespecifiedthread
(pid).Ifthepidiszerotheschedulingpolicyandrelatedattributesofthecalling
threadwillberetrieved.ThisisaLinux-specificcallandisasupersetofthe
functionalityprovidedbysched_getscheduler()andsched_getparam()calls.On
success,itreturnszero.
sched_get_priority_max(intpolicy)
sched_get_priority_min(intpolicy)
Thisreturnsthemaxandminpriorityrespectivelyforthespecifiedpolicy.fifo,
rr,deadline,normal,batch,andidlearesupportedvaluesofpolicy.
sched_rr_get_interval(pid_tpid,structtimespec*tp)
Itfetchesthetimequantumofthespecifiedthread(pid)andwritesitintothe
timespecstruct,specifiedbytp.Ifthepidiszero,thetimequantumofthecalling
processisfetchedintotp.Thisisonlyapplicabletoprocesseswiththerrpolicy.
Onsuccess,itreturnszero.
sched_yield(void)
ThisiscalledtorelinquishtheCPUexplicitly.Thethreadisnowaddedbackto
thequeue.Onsuccess,itreturnszero.
Processoraffinitycalls
Linux-specificprocessoraffinitycallsareprovided,whichhelpthethreads
defineonwhichCPU(s)theywanttorun.Bydefault,everythreadinheritsthe
processoraffinityofitsparent,butitcandefineitsaffinitymasktodetermineits
processoraffinity.Onmany-coresystems,CPUaffinitycallshelpinenhancing
theperformance,byhelpingtheprocesssticktoonecore(Linuxhowever
attemptstokeepathreadononeCPU).Theaffinitybitmaskinformationis
containedinthecpu_allowedfieldofstructtask_struct.Theaffinitycallsareas
follows:
sched_setaffinity(pid_tpid,size_tcpusetsize,constcpu_set_t*mask)
ItsetstheCPUaffinitymaskofthethread(pid)tothevaluementionedbymask.If
thethread(pid)isnotrunninginoneofthespecifiedCPU'squeues,itismigrated
tothespecifiedcpu.Onsuccess,itreturnszero.
sched_getaffinity(pid_tpid,size_tcpusetsize,cpu_set_t*mask)
Thisfetchestheaffinitymaskofthethread(pid)intothecpusetsizestructure,
pointedtobymask.Ifthepidiszero,themaskofthecallingthreadisreturned.
Onsuccess,itreturnszero.
Processpreemption
Understandingpreemptionandcontextswitchingiskeytofullycomprehending
schedulingandtheimpactithasonthekernelinmaintaininglowlatencyand
consistency.Everyprocessmustbepreemptedeitherimplicitlyorexplicitlyto
makewayforanotherprocess.Preemptionmightleadtocontextswitching,
whichrequiresalow-levelarchitecture-specificoperation,carriedoutbythe
functioncontext_switch().Therearetwoprimarytasksthatneedtobedonefora
processortoswitchitscontext:switchthevirtualmemorymappingoftheold
processwiththenewone,andswitchtheprocessorstatefromthatoftheold
processtothenewone.Thesetwotasksarecarriedoutbyswitch_mm()and
switch_to().
Preemptioncanhappenforanyofthefollowingreasons:
Whenahigh-priorityprocessbecomesrunnable.Forthis,theschedulerwill
havetoperiodicallycheckforahigh-priorityrunnablethread.Onreturnfrom
interruptsandsystemcalls,TIF_NEED_RESCHEDULE(kernel-providedflagthatindicates
theneedforareschedule)isset,invokingthescheduler.Sincethereisaperiodic
timerinterruptthatisguaranteedtooccuratregularintervals,invocationofthe
schedulerisguaranteed.Preemptionalsohappenswhenaprocessentersa
blockingcalloronoccurrenceofaninterruptevent.
TheLinuxkernelhistoricallyhasbeennon-preemptive,whichmeansataskin
kernelmodeisnon-preemptibleunlessaninterrupteventoccursoritchoosesto
explicitlyrelinquishCPU.Sincethe2.6kernel,preemptionhasbeenadded
(needstobeenabledduringkernelbuild).Withkernelpreemptionenabled,a
taskinkernelmodeispreemptibleforallthereasonslisted,butakernel-mode
taskisallowedtodisablekernelpreemptionwhilecarryingoutcritical
operations.Thishasbeenmadepossiblebyaddingapreemptioncounter
(preempt_count)toeachprocess'sthread_infostructure.Taskscandisable/enable
preemptionthroughthekernelmacrospreempt_disable()andpreempt_enable(),which
inturnincrementanddecrementthepreempt_counter.Thisensuresthatthekernel
ispreemptibleonlywhenthepreempt_counteriszero(indicatingnoacquired
locks).
Criticalsectionsinthekernelcodeareexecutedbydisablingpreemption,which
isenforcedbyinvokingpreempt_disableandpreempt_enablecallswithinkernellock
operations(spinlock,mutex).
Linuxkernelsbuildwith"preemptrt",supportingfullypreemptiblekernel
option,whichwhenenabledmakesallthekernelcodeincludingcriticalsections
befullypreemptible.
Summary
Processschedulingisanever-evolvingaspectofthekernel,andasLinux
evolvesanddiversifiesfurtherintomanycomputingdomains,finertweaksand
changestotheprocessschedulerwillbemandated.However,withour
understandingestablishedoverthischapter,gainingdeeperinsightsor
comprehendinganynewchangeswillbequiteeasy.Wearenowequippedtogo
furtherandexploreanotherimportantaspectofjobcontrolandsignal
management.Wewillbrushthroughbasicsofsignalsandmoveonintosignal
managementdatastructuresandroutinesofthekernel.
SignalManagement
Signalsprovideafundamentalinfrastructureinwhichanyprocesscanbe
notifiedofasystemeventasynchronously.Theycanalsobeengagedas
communicationmechanismsbetweenprocesses.Understandinghowthekernel
providesandmanagessmooththroughputoftheentiresignal-handling
mechanismletsusgainmoregroundingonthekernel.Inthischapter,weshall
pileonourunderstandingofsignals,rightfromhowprocessescanusherthemto
howthekerneldeftlymanagestheroutinestoensuresignaleventstick.Weshall
lookatthefollowingtopicsingreatdetail:
Overviewofsignalsandtheirtypes
Process-levelsignal-managementcalls
Signaldatastructuresinprocessdescriptors
Kernel'ssignalgenerationanddeliverymechanisms
Signals
Signalsareshortmessagesdeliveredtoaprocessoraprocessgroup.Thekernel
usessignalstonotifyprocessesabouttheoccurrenceofasystemevent;signals
arealsousedforcommunicationbetweenprocesses.Linuxcategorizessignals
intotwogroups,namelygeneral-purposePOSIX(classicUnixsignals)andreal-
timesignals.Eachgroupconsistsof32distinctsignals,identifiedbyaunique
ID:
#define_NSIG64
#define_NSIG_BPW__BITS_PER_LONG
#define_NSIG_WORDS(_NSIG/_NSIG_BPW)
#defineSIGHUP1
#defineSIGINT2
#defineSIGQUIT3
#defineSIGILL4
#defineSIGTRAP5
#defineSIGABRT6
#defineSIGIOT6
#defineSIGBUS7
#defineSIGFPE8
#defineSIGKILL9
#defineSIGUSR110
#defineSIGSEGV11
#defineSIGUSR212
#defineSIGPIPE13
#defineSIGALRM14
#defineSIGTERM15
#defineSIGSTKFLT16
#defineSIGCHLD17
#defineSIGCONT18
#defineSIGSTOP19
#defineSIGTSTP20
#defineSIGTTIN21
#defineSIGTTOU22
#defineSIGURG23
#defineSIGXCPU24
#defineSIGXFSZ25
#defineSIGVTALRM26
#defineSIGPROF27
#defineSIGWINCH28
#defineSIGIO29
#defineSIGPOLLSIGIO
/*
#defineSIGLOST29
*/
#defineSIGPWR30
#defineSIGSYS31
#defineSIGUNUSED31
/*Theseshouldnotbeconsideredconstantsfromuserland.*/
#defineSIGRTMIN32
#ifndefSIGRTMAX
#defineSIGRTMAX_NSIG
#endif
Signalsinthegeneral-purposecategoryareboundtoaspecificsystemeventand
arenamedappropriatelythroughmacros.Thoseinthereal-timecategoryaren't
boundtoaspecificevent,andarefreeforapplicationstoengageforprocess
communication;thekernelreferstothemwithgenericnames:SIGRTMINand
SIGRTMAX.
Upongenerationofasignal,thekerneldeliversthesignaleventtothe
destinationprocess,whichinturncanrespondtothesignalaspertheconfigured
action,calledsignaldisposition.
Thefollowingisthelistofactionsthataprocesscansetupasitssignal
disposition.Aprocesscansetupanyoneoftheactionsasitssignaldisposition
atapointintime,butitcanswitchbetweentheseactionsanynumberoftimes
withoutanyrestrictions.
Kernelhandler:Thekernelimplementsadefaulthandlerforeachsignal.
Thesehandlersareavailabletoaprocessthroughthesignalhandlertableof
itstaskstructure.Uponreceptionofasignal,aprocesscanrequest
executionoftheappropriatesignalhandler.Thisisthedefaultdisposition.
Processdefinedhandler:Aprocessisallowedtoimplementitsownsignal
handlers,andsetthemuptobeexecutedinresponsetoasignalevent.This
ismadepossiblethroughtheappropriatesystemcallinterface,which
allowstheprocesstobinditshandlerroutinewithasignal.Onoccurrence
ofasignal,theprocesshandlerwouldbeinvokedasynchronously.
Ignore:Aprocessisalsoallowedtoignoretheoccurrenceofasignal,butit
needstoannounceitsintenttoignorebyinvokingtheappropriatesystem
call.
Kernel-defineddefaulthandlerroutinescanexecuteanyofthefollowingactions:
Ignore:Nothinghappens.
Terminate:Killtheprocess,thatis,allthreadsinthegroup(similarto
exit_group).Thegroupleader(only)reportstheWIFSIGNALEDstatustoitsparent.
Coredump:Writeacoredumpfiledescribingallthreadsusingthesamemm
andthenkillallthosethreads
Stop:Stopallthethreadsinthegroup,thatis,theTASK_STOPPEDstate.
Followingisthesummarizedtablethatlistsoutactionsexecutedbydefault
handlers:
+--------------------+------------------+
*|POSIXsignal|defaultaction|
*+------------------+------------------+
*|SIGHUP|terminate
*|SIGINT|terminate
*|SIGQUIT|coredump
*|SIGILL|coredump
*|SIGTRAP|coredump
*|SIGABRT/SIGIOT|coredump
*|SIGBUS|coredump
*|SIGFPE|coredump
*|SIGKILL|terminate
*|SIGUSR1|terminate
*|SIGSEGV|coredump
*|SIGUSR2|terminate
*|SIGPIPE|terminate
*|SIGALRM|terminate
*|SIGTERM|terminate
*|SIGCHLD|ignore
*|SIGCONT|ignore
*|SIGSTOP|stop
*|SIGTSTP|stop
*|SIGTTIN|stop
*|SIGTTOU|stop
*|SIGURG|ignore
*|SIGXCPU|coredump
*|SIGXFSZ|coredump
*|SIGVTALRM|terminate
*|SIGPROF|terminate
*|SIGPOLL/SIGIO|terminate
*|SIGSYS/SIGUNUSED|coredump
*|SIGSTKFLT|terminate
*|SIGWINCH|ignore
*|SIGPWR|terminate
*|SIGRTMIN-SIGRTMAX|terminate
*+------------------+------------------+
*|non-POSIXsignal|defaultaction|
*+------------------+------------------+
*|SIGEMT|coredump|
*+--------------------+------------------+
#include<signal.h><br/>intsigaction(intsignum,conststruct
sigaction*act,structsigaction*oldact);<br/><br/>Thesigaction
structureisdefinedassomethinglike:<br/><br/>structsigaction
{<br/>void(*sa_handler)(int);<br/>void(*sa_sigaction)(int,
siginfo_t*,void*);<br/>sigset_tsa_mask;<br/>intsa_flags;<br/>
void(*sa_restorer)(void);<br/>};
<span>voidhandler_fn(int</span><spanclass="phsynph"><span
class="phvar">signo</span></span><span>,siginfo_t*</span>
<spanclass="phsynph"><spanclass="phvar">info</span></span>
<span>,void*</span><spanclass="phsynph"><spanclass="ph
var">context</span></span><span>);</span>
siginfo_t{<br/>intsi_signo;/*Signalnumber*/<br/>intsi_errno;/*
Anerrnovalue*/<br/>intsi_code;/*Signalcode*/<br/>int
si_trapno;/*Trapnumberthatcausedhardware-generatedsignal
(unusedonmostarchitectures)*/<br/>pid_tsi_pid;/*Sending
processID*/<br/>uid_tsi_uid;/*RealuserIDofsendingprocess
*/<br/>intsi_status;/*Exitvalueorsignal*/<br/>clock_tsi_utime;
/*Usertimeconsumed*/<br/>clock_tsi_stime;/*Systemtime
consumed*/<br/>sigval_tsi_value;/*Signalvalue*/<br/>intsi_int;
/*POSIX.1bsignal*/<br/>void*si_ptr;/*POSIX.1bsignal*/<br/>
intsi_overrun;/*Timeroverruncount;POSIX.1btimers*/<br/>int
si_timerid;/*TimerID;POSIX.1btimers*/<br/>void*si_addr;/*
Memorylocationwhichcausedfault*/<br/>longsi_band;/*Band
event(wasintinglibc2.3.2andearlier)*/<br/>intsi_fd;/*File
descriptor*/<br/>shortsi_addr_lsb;/*Leastsignificantbitofaddress
(sinceLinux2.6.32)*/<br/>void*si_call_addr;/*Addressofsystem
callinstruction(sinceLinux3.5)*/<br/>intsi_syscall;/*Numberof
attemptedsystemcall(sinceLinux3.5)*/<br/>unsignedintsi_arch;
/*Architectureofattemptedsystemcall(sinceLinux3.5)*/<br/>}
intsigprocmask(inthow,constsigset_t*set,sigset_t*oldset);
intsigpending(sigset_t*set);
TheoperationsareapplicableforallsignalsexceptSIGKILLand
SIGSTOP;inotherwords,processesarenotallowedtoalterthedefault
dispositionorblockSIGSTOPandSIGKILLsignals.
intkill(pid_tpid,intsig);<br/>intsigqueue(pid_tpid,intsig,const
unionsigvalvalue);<br/><br/>unionsigval{<br/>intsival_int;<br/>
void*sival_ptr;<br/>};
/*queuesignaltospecificthreadinathreadgroup*/<br/>int
tgkill(inttgid,inttid,intsig);<br/><br/>/*queuesignalanddatatoa
threadgroup*/<br/>intrt_sigqueueinfo(pid_ttgid,intsig,siginfo_t
*uinfo);<br/><br/>/*queuesignalanddatatospecificthreadina
threadgroup*/<br/>intrt_tgsigqueueinfo(pid_ttgid,pid_ttid,intsig,
siginfo_t*uinfo);<br/><br/>
Waitingforqueuedsignals
Whenapplyingsignalsforprocesscommunication,itmightbemoreappropriate
foraprocesstosuspenditselfuntiltheoccurrenceofaspecificsignal,and
resumeexecutiononthearrivalofasignalfromanotherprocess.ThePOSIX
callssigsuspend(),sigwaitinfo(),andsigtimedwait()providethisfunctionality:
intsigsuspend(constsigset_t*mask);
intsigwaitinfo(constsigset_t*set,siginfo_t*info);
intsigtimedwait(constsigset_t*set,siginfo_t*info,conststructtimespec*timeout);
WhilealloftheseAPIsallowaprocesstowaitforaspecifiedsignaltooccur,
sigwaitinfo()providesadditionaldataaboutthesignalthroughthesiginfo_t
instancereturnedthroughtheinfopointer.sigtimedwait()extendsthefunctionality
byprovidinganadditionalargumentthatallowstheoperationtotimeout,
makingitaboundedwaitcall.TheLinuxkernelprovidesanalternateAPIthat
allowstheprocesstobenotifiedabouttheoccurrenceofasignalthrougha
specialfiledescriptorcalledsignalfd():
#include<sys/signalfd.h>
intsignalfd(intfd,constsigset_t*mask,intflags);
Onsuccess,signalfd()returnsafiledescriptor,onwhichtheprocessneedsto
invokeread(),whichblocksuntilanyofthesignalsspecifiedinthemaskoccur.
structtask_struct{<br/><br/>....<br/>....<br/>....<br/>/*signal
handlers*/<br/>structsignal_struct*signal;<br/>struct
sighand_struct*sighand;<br/><br/>sigset_tblocked,real_blocked;
<br/>sigset_tsaved_sigmask;/*restoredifset_restore_sigmask()
wasused*/<br/>structsigpendingpending;<br/><br/>unsignedlong
sas_ss_sp;<br/>size_tsas_ss_size;<br/>unsignedsas_ss_flags;<br/>
....<br/>....<br/>....<br/>....<br/><br/>};
Signaldescriptors
RecallfromourearlierdiscussionsinthefirstchapterthatLinuxsupportsmulti-
threadedapplicationsthroughlightweightprocesses.AllLWPsofathreaded
applicationarepartofaprocessgroupandsharesignalhandlers;eachLWP
(thread)maintainsitsownpending,andblockedsignalqueues.
Thesignalpointerofthetaskstructurereferstotheinstanceoftypesignal_struct,
whichisthesignaldescriptor.ThisstructureissharedbyallLWPsofathread
groupandmaintainselementssuchasasharedpendingsignalqueue(forsignals
queuedtoathreadgroup),whichiscommontoallthreadsinaprocessgroup.
Thefollowingfigurerepresentsthedatastructuresinvolvedinmaintaining
sharedpendingsignals:
Followingareafewimportantfieldsofsignal_struct:structsignal_struct{
atomic_tsigcnt;
atomic_tlive;
intnr_threads;
structlist_headthread_head;
wait_queue_head_twait_chldexit;/*forwait4()*/
/*currentthreadgroupsignalload-balancingtarget:*/
structtask_struct*curr_target;
/*sharedsignalhandling:*/
structsigpendingshared_pending;
/*threadgroupexitsupport*/
intgroup_exit_code;
/*overloaded:
*-notifygroup_exit_taskwhen->countisequaltonotify_count
*-everyoneexceptgroup_exit_taskisstoppedduringsignaldelivery
*offatalsignals,group_exit_taskprocessesthesignal.
*/
intnotify_count;
structtask_struct*group_exit_task;
/*threadgroupstopsupport,overloadsgroup_exit_codetoo*/
intgroup_stop_count;
unsignedintflags;/*seeSIGNAL_*flagsbelow*/
Blockedandpendingqueues
blockedandreal_blockedinstancesinthetaskstructurearebitmasksofblocked
signals;thesequeuesareper-process.EachLWPinathreadgroupthushasits
ownblockedsignalmask.Thependinginstanceofthetaskstructureisusedto
queueprivatependingsignals;allsignalsqueuedtoanormalprocessanda
specificLWPinathreadgrouparequeuedintothislist:structsigpending{
structlist_headlist;//headtodoublelinkedlistofstructsigqueue
sigset_tsignal;//bitmaskofpendingsignals
};
Thefollowingfigurerepresentsthedatastructuresinvolvedinmaintaining
privatependingsignals:
Signalhandlerdescriptor
Thesighandpointerofthetaskstructurereferstoaninstanceofthestruct
sighand_struct,whichisthesignalhandlerdescriptorsharedbyallprocessesina
threadgroup.Thisstructureisalsosharedbyallprocessescreatedusingclone()
withtheCLONE_SIGHANDflag.Thisstructureholdsanarrayofk_sigactioninstances,
eachwrappinganinstanceofsigactionthatdescribesthecurrentdispositionof
eachsignal:structk_sigaction{
structsigactionsa;
#ifdef__ARCH_HAS_KA_RESTORER
__sigrestore_tka_restorer;
#endif
};
structsighand_struct{
atomic_tcount;
structk_sigactionaction[_NSIG];
spinlock_tsiglock;
wait_queue_head_tsignalfd_wqh;
};
Thefollowingfigurerepresentsthesignalhandlerdescriptor:
Signalgenerationanddelivery
Asignalissaidtobegeneratedwhenitsoccurrenceisenqueued,tolistof
pendingsignalsinthetaskstructureofthereceiverprocessorprocesses.The
signalisgenerated(onaprocessoragroup)uponrequestfromauser-mode
process,kernel,oranyofthekernelservices.Asignalisconsideredtobe
deliveredwhenthereceiverprocessorprocessesaremadeawareofits
occurrenceandareforcedtoexecutetheappropriateresponsehandler;inother
words,signaldeliveryisequaltoinitializationofthecorrespondinghandler.
Ideally,everysignalgeneratedisassumedtobeinstantlydelivered;however,
thereisapossibilityofdelaybetweensignalgeneration,anditeventualdelivery.
Tofacilitatepossibledeferreddelivery,thekernelprovidesseparatefunctionsfor
signalgenerationanddelivery.
staticintsend_signal(intsig,structsiginfo*info,structtask_struct*t,
<br/>intgroup)<br/>{<br/>intfrom_ancestor_ns=0;<br/>
<br/>#ifdefCONFIG_PID_NS<br/>from_ancestor_ns=
si_fromuser(info)&&<br/>!task_pid_nr_ns(current,
task_active_pid_ns(t));<br/>#endif<br/><br/>return<strong>
__send_signal(sig,info,t,group,from_ancestor_ns)</strong>;<br/>}
/*<br/>*fast-pathedsignalsforkernel-internalthingslike
SIGSTOP<br/>*orSIGKILL.<br/>*/<br/>if(info==
SEND_SIG_FORCED)<br/>gotoout_set;<br/>....<br/>....<br/>....
<br/>out_set:<br/>signalfd_notify(t,sig);<br/>sigaddset(&pending-
>signal,sig);<br/>complete_signal(sig,t,group);<br/><br/>
q=__sigqueue_alloc(sig,t,GFP_ATOMIC|
__GFP_NOTRACK_FALSE_POSITIVE,<br/>override_rlimit);
if(q){<br/>list_add_tail(&q->list,&pending->list);<br/>switch
((unsignedlong)info){<br/>case(unsignedlong)
SEND_SIG_NOINFO:<br/>q->info.si_signo=sig;<br/>q-
>info.si_errno=0;<br/>q->info.si_code=SI_USER;<br/>q-
>info.si_pid=task_tgid_nr_ns(current,<br/>task_active_pid_ns(t));
<br/>q->info.si_uid=from_kuid_munged(current_user_ns(),
current_uid());<br/>break;<br/>case(unsignedlong)
SEND_SIG_PRIV:<br/>q->info.si_signo=sig;<br/>q-
>info.si_errno=0;<br/>q->info.si_code=SI_KERNEL;<br/>q-
>info.si_pid=0;<br/>q->info.si_uid=0;<br/>break;<br/>default:
<br/>copy_siginfo(&q->info,info);<br/>if(from_ancestor_ns)<br/>
q->info.si_pid=0;<br/>break;<br/>}<br/><br/><br/>
sigaddset(&pending->signal,sig);<br/>complete_signal(sig,t,
group);
Signaldelivery
Afterasignalisgeneratedbyupdatingappropriateentriesinthereceiver'stask
structure,throughanyofthepreviouslymentionedsignal-generationcalls,the
kernelmovesintodeliverymode.Thesignalisinstantlydeliveredifthereceiver
processwasonCPUandhasnotblockedthespecifiedsignal.Prioritysignals
SIGSTOPandSIGKILLaredeliveredevenifthereceiverisnotonCPUbywakingup
theprocess;however,fortherestofthesignals,deliveryisdeferreduntilthe
processisreadytoreceivesignals.Tofacilitatedeferreddelivery,thekernel
checksfornonblockedpendingsignalsofaprocessonreturnfrominterrupt
andsystemcallsbeforeallowingaprocesstoresumeuser-modeexecution.
Whentheprocessscheduler(invokedonreturnfrominterruptandexceptions)
findstheTIF_SIGPENDINGflagset,itinvokesthekernelfunctiondo_signal()toinitiate
deliveryofthependingsignalbeforeresumingtheuser-modecontextofthe
process.
Uponentryintokernelmode,theuser-moderegisterstateoftheprocessisstored
intheprocesskernelstackinastructurecalledpt_regs(architecturespecific):
structpt_regs{
/*
*CABIsaystheseregsarecallee-preserved.Theyaren'tsavedonkernelentry
*unlesssyscallneedsacomplete,fullyfilled"structpt_regs".
*/
unsignedlongr15;
unsignedlongr14;
unsignedlongr13;
unsignedlongr12;
unsignedlongrbp;
unsignedlongrbx;
/*Theseregsarecallee-clobbered.Alwayssavedonkernelentry.*/
unsignedlongr11;
unsignedlongr10;
unsignedlongr9;
unsignedlongr8;
unsignedlongrax;
unsignedlongrcx;
unsignedlongrdx;
unsignedlongrsi;
unsignedlongrdi;
/*
*Onsyscallentry,thisissyscall#.OnCPUexception,thisiserrorcode.
*Onhwinterrupt,it'sIRQnumber:
*/
unsignedlongorig_rax;
/*Returnframeforiretq*/
unsignedlongrip;
unsignedlongcs;
unsignedlongeflags;
unsignedlongrsp;
unsignedlongss;
/*topofstackpage*/
};
Thedo_signal()routineisinvokedwiththeaddressofpt_regsinthekernelstack.
Thoughdo_signal()ismeanttodelivernonblockedpendingsignals,its
implementationisarchitecturespecific.
Followingisthex86versionofdo_signal():voiddo_signal(structpt_regs*regs)
{
structksignalksig;
if(get_signal(&ksig)){
/*Whee!Actuallydeliverthesignal.*/
handle_signal(&ksig,regs);
return;
}
/*Didwecomefromasystemcall?*/
if(syscall_get_nr(current,regs)>=0){
/*Restartthesystemcall-nohandlerspresent*/
switch(syscall_get_error(current,regs)){
case-ERESTARTNOHAND:
case-ERESTARTSYS:
case-ERESTARTNOINTR:
regs->ax=regs->orig_ax;
regs->ip-=2;
break;
case-ERESTART_RESTARTBLOCK:
regs->ax=get_nr_restart_syscall(regs);
regs->ip-=2;
break;
}
}
/*
*Ifthere'snosignaltodeliver,wejustputthesavedsigmask
*back.
*/
restore_saved_sigmask();
}
do_signal()invokestheget_signal()functionwiththeaddressofaninstanceof
typestructksignal(weshallbrieflyconsiderimportantstepsofthisroutine,
skippingotherdetails).Thisfunctioncontainsaloopthatinvokesdequeue_signal()
untilallnon-blockedpendingsignalsfrombothprivateandsharedpendinglists
aredequeued.Itbeginswithlookupintotheprivatependingsignalqueue,
startingfromthelowest-numberedsignal,andfollowsintopendingsignalsinthe
sharedqueue,andthenupdatesthedatastructurestoindicatethatthesignalisno
longerpendingandreturnsitsnumber:
signr=dequeue_signal(current,¤t->blocked,&ksig->info);
Foreachpendingsignalreturnedbydequeue_signal()),get_signal()retrievesthe
currentsignaldispositionthroughapointeroftypestructksigaction*ka:
ka=&sighand->action[signr-1];
IfsignaldispositionissettoSIG_IGN,itsilentlyignoresthecurrentsignaland
continuesiterationtoretrieveanotherpendingsignal:if(ka->sa.sa_handler==
SIG_IGN)/*Donothing.*/
continue;
IfdispositionisnotequaltoSIG_DFL,itretrievestheaddressofsigactionand
initializesitintoargumentsksig->kaforfurtherexecutionoftheuser-mode
handler.ItfurtherchecksfortheSA_ONESHOT(SA_RESETHAND)flagintheuser's
sigactionand,ifset,resetsthesignaldispositiontoSIG_DFL,breaksoutofthe
loop,andreturnstothecaller.do_signal()nowinvokesthehandle_signal()routine
toexecutetheuser-modehandler(weshalldiscussthisindetailinthenext
section).
if(ka->sa.sa_handler!=SIG_DFL){
/*Runthehandler.*/
ksig->ka=*ka;
if(ka->sa.sa_flags&SA_ONESHOT)
ka->sa.sa_handler=SIG_DFL;
break;/*willreturnnon-zero"signr"value*/
}
IfdispositionissettoSIG_DFL,itinvokesasetofmacrostocheckforthedefault
actionofthekernelhandler.Possibledefaultactionsare:
Term:Defaultactionistoterminatetheprocess
Ign:Defaultactionistoignorethesignal
Core:Defaultactionistoterminatetheprocessanddumpcore
Stop:Defaultactionistostoptheprocess
Cont:Defaultactionistocontinuetheprocessifitiscurrentlystopped
Followingisacodesnippetfromget_signal()thatinitiatesthedefaultactionas
perthesetdisposition:/*
*Nowwearedoingthedefaultactionforthissignal.
*/
if(sig_kernel_ignore(signr))/*Defaultisnothing.*/
continue;
/*
*Globalinitgetsnosignalsitdoesn'twant.
*Container-initgetsnosignalsitdoesn'twantfromsame
*container.
*
*Notethatifglobal/container-initseesasig_kernel_only()
*signalhere,thesignalmusthavebeengeneratedinternally
*ormusthavecomefromanancestornamespace.Ineither
*case,thesignalcannotbedropped.
*/
if(unlikely(signal->flags&SIGNAL_UNKILLABLE)&&
!sig_kernel_only(signr))
continue;
if(sig_kernel_stop(signr)){
/*
*Thedefaultactionistostopallthreadsin
*thethreadgroup.Thejobcontrolsignals
*donothinginanorphanedpgrp,butSIGSTOP
*alwaysworks.Notethatsiglockneedstobe
*droppedduringthecalltois_orphaned_pgrp()
*becauseoflockorderingwithtasklist_lock.
*ThisallowsaninterveningSIGCONTtobeposted.
*Weneedtocheckforthatandbailoutifnecessary.
*/
if(signr!=SIGSTOP){
spin_unlock_irq(&sighand->siglock);
/*signalscanbepostedduringthiswindow*/
if(is_current_pgrp_orphaned())
gotorelock;
spin_lock_irq(&sighand->siglock);
}
if(likely(do_signal_stop(ksig->info.si_signo))){
/*Itreleasedthesiglock.*/
gotorelock;
}
/*
*Wedidn'tactuallystop,duetoarace
*withSIGCONTorsomethinglikethat.
*/
continue;
}
spin_unlock_irq(&sighand->siglock);
/*
*Anythingelseisfatal,maybewithacoredump.
*/
current->flags|=PF_SIGNALED;
if(sig_kernel_coredump(signr)){
if(print_fatal_signals)
print_fatal_signal(ksig->info.si_signo);
proc_coredump_connector(current);
/*
*Ifitwasabletodumpcore,thiskillsall
*otherthreadsinthegroupandsynchronizeswith
*theirdemise.Ifwelosttheracewithanother
*threadgettinghere,itsetgroup_exit_code
*firstandourdo_group_exitcallbelowwilluse
*thatvalueandignoretheonewepassit.
*/
do_coredump(&ksig->info);
}
/*
*Deathsignals,nocoredump.
*/
do_group_exit(ksig->info.si_signo);
/*NOTREACHED*/
}
First,themacrosig_kernel_ignorechecksforthedefaultactionignore.Iftrue,it
continuesloopiterationtolookforthenextpendingsignal.Thesecondmacro
sig_kernel_stopchecksforthedefaultactionstop;iftrue,itinvokesthe
do_signal_stop()routine,whichputseachthreadintheprocessgroupintothe
TASK_STOPPEDstate.Thethirdmacrosig_kernel_coredumpchecksforthedefaultaction
dump;iftrue,itinvokesthedo_coredump()routine,whichgeneratesthecoredump
binaryfileandterminatesalltheprocessesinthethreadgroup.Next,forsignals
withdefaultactionterminate,allthreadsinthegrouparekilledbyinvokingthe
do_group_exit()routine.
Executinguser-modehandlers
Recallfromourdiscussionintheprevioussectionthatdo_signal()invokesthe
handle_signal()routinefordeliveryofpendingsignalswhosedispositionissetto
userhandler.Theuser-modesignalhandlerresidesintheprocesscodesegment
andrequiresaccesstotheuser-modestackoftheprocess;therefore,thekernel
needstoswitchtotheuser-modestackforexecutingthesignalhandler.
Successfulreturnfromthesignalhandlerrequiresaswitchbacktothekernel
stacktorestoretheusercontextfornormaluser-modeexecution,butsuchan
operationwouldfailsincethekernelstackwouldnolongercontaintheuser
context(structpt_regs)sinceitisemptiedoneachentryoftheprocessfromuser
tokernelmode.
Toensuresmoothtransitionoftheprocessforitsnormalexecutioninusermode
(onreturnfromthesignalhandler),handle_signal()movestheuser-modehardware
context(structpt_regs)inthekernelstackintotheuser-modestack(struct
ucontext)andsetsupthehandlerframetoinvokethe_kernel_rt_sigreturn()routine
duringreturn;thisfunctioncopiesthehardwarecontextbackintothekernel
stackandrestorestheuser-modecontextforresumingnormalexecutionofthe
currentprocess.
Thefollowingfiguredepictstheexecutionofauser-modesignalhandler:
Settingupuser-modehandlerframes
Tosetupastackframeforauser-modehandler,handle_signal()invokes
setup_rt_frame()withtheaddressoftheinstanceofksignal,whichcontainsthe
k_sigactionassociatedwiththesignalandthepointertostructpt_regsinthekernel
stackofthecurrentprocess.
Followingisx86implementationofsetup_rt_frame():
setup_rt_frame(structksignal*ksig,structpt_regs*regs)
{
intusig=ksig->sig;
sigset_t*set=sigmask_to_save();
compat_sigset_t*cset=(compat_sigset_t*)set;
/*Setupthestackframe*/
if(is_ia32_frame(ksig)){
if(ksig->ka.sa.sa_flags&SA_SIGINFO)
returnia32_setup_rt_frame(usig,ksig,cset,regs);//for32bitsystemswithSA_SIGINFO
else
returnia32_setup_frame(usig,ksig,cset,regs);//for32bitsystemswithoutSA_SIGINFO
}elseif(is_x32_frame(ksig)){
returnx32_setup_rt_frame(ksig,cset,regs);//forsystemswithx32ABI
}else{
return__setup_rt_frame(ksig->sig,ksig,set,regs);//Othervariantsofx86
}
}
Itchecksforthespecificvariantofx86andinvokestheappropriateframesetup
routine.Forfurtherdiscussion,weshallfocuson__setup_rt_frame(),whichapplies
forx86-64.Thisfunctionpopulatesaninstanceofastructurecalledstruct
rt_sigframewithinformationneededtohandlethesignal,setsupareturnpath
(throughthe_kernel_rt_sigreturn()function),andpushesitintotheuser-mode
stack:
/*arch/x86/include/asm/sigframe.h*/
#ifdefCONFIG_X86_64
structrt_sigframe{
char__user*pretcode;
structucontextuc;
structsiginfoinfo;
/*fpstatefollowshere*/
};
-----------------------
/*arch/x86/kernel/signal.c*/
staticint__setup_rt_frame(intsig,structksignal*ksig,
sigset_t*set,structpt_regs*regs)
{
structrt_sigframe__user*frame;
void__user*restorer;
interr=0;
void__user*fpstate=NULL;
/*setupframewithFloatingPointstate*/
frame=get_sigframe(&ksig->ka,regs,sizeof(*frame),&fpstate);
if(!access_ok(VERIFY_WRITE,frame,sizeof(*frame)))
return-EFAULT;
put_user_try{
put_user_ex(sig,&frame->sig);
put_user_ex(&frame->info,&frame->pinfo);
put_user_ex(&frame->uc,&frame->puc);
/*Createtheucontext.*/
if(boot_cpu_has(X86_FEATURE_XSAVE))
put_user_ex(UC_FP_XSTATE,&frame->uc.uc_flags);
else
put_user_ex(0,&frame->uc.uc_flags);
put_user_ex(0,&frame->uc.uc_link);
save_altstack_ex(&frame->uc.uc_stack,regs->sp);
/*Setuptoreturnfromuserspace.*/
restorer=current->mm->context.vdso+
vdso_image_32.sym___kernel_rt_sigreturn;
if(ksig->ka.sa.sa_flags&SA_RESTORER)
restorer=ksig->ka.sa.sa_restorer;
put_user_ex(restorer,&frame->pretcode);
/*
*Thisismovl$__NR_rt_sigreturn,%ax;int$0x80
*
*WEDONOTUSEITANYMORE!It'sonlylefthereforhistorical
*reasonsandbecausegdbusesitasasignaturetonotice
*signalhandlerstackframes.
*/
put_user_ex(*((u64*)&rt_retcode),(u64*)frame->retcode);
}put_user_catch(err);
err|=copy_siginfo_to_user(&frame->info,&ksig->info);
err|=setup_sigcontext(&frame->uc.uc_mcontext,fpstate,
regs,set->sig[0]);
err|=__copy_to_user(&frame->uc.uc_sigmask,set,sizeof(*set));
if(err)
return-EFAULT;
/*Setupregistersforsignalhandler*/
regs->sp=(unsignedlong)frame;
regs->ip=(unsignedlong)ksig->ka.sa.sa_handler;
regs->ax=(unsignedlong)sig;
regs->dx=(unsignedlong)&frame->info;
regs->cx=(unsignedlong)&frame->uc;
regs->ds=__USER_DS;
regs->es=__USER_DS;
regs->ss=__USER_DS;
regs->cs=__USER_CS;
return0;
}
The*pretcodefieldofthert_sigframestructureisassignedthereturnaddressofthe
signal-handlerfunction,whichisthe_kernel_rt_sigreturn()routine.structucontext
ucisinitializedwithsigcontext,whichcontainstheuser-modecontextcopiedfrom
pt_regsofthekernelstack,bitarrayofregularblockedsignals,andfloatingpoint
state.Aftersettingupandpushingtheframeinstancetotheuser-modestack,
__setup_rt_frame()alterspt_regsoftheprocessinthekernelstacktohandover
controltothesignalhandlerwhenthecurrentprocessresumesexecution.The
instructionpointer(ip)issettothebaseaddressofthesignalhandlerandthe
stackpointer(sp)issettothetopaddressoftheframepushedearlier;these
changescausethesignalhandlertoexecute.
Restartinginterruptedsystemcalls
WeunderstoodinChapter1,ComprehendingProcesses,AddressSpace,and
Threadsthatuser-modeprocessesinvokesystemcallstoswitchintokernelmode
forexecutingkernelservices.Whenaprocessentersakernelserviceroutine,
thereisapossibilityoftheroutinebeingblockedforavailabilityofresources
(forexample,waitonexclusionlock)oroccurrenceofanevent(suchas
interrupts).Suchblockingoperationsrequirethecallerprocesstobeputintothe
TASK_INTERRUPTIBLE,TASK_UNINTERRUPTIBLE,orTASK_KILLABLEstate.Thespecificstate
effecteddependsonthechoiceofblockingcallinvokedinthesystemcalls.
IfthecallertaskisputintotheTASK_UNINTERRUPTIBLEstate,occurrencesofsignalson
thattaskaregenerated,causingthemtoenterthependinglist,andaredelivered
totheprocessonlyaftercompletionoftheserviceroutine(onitsreturnpathto
usermode).However,ifthetaskwasputintotheTASK_INTERRUPTIBLEstate,
occurrencesofsignalsonthattaskaregeneratedandanimmediatedeliveryis
attemptedbyalteringitsstatetoTASK_RUNNING,whichcausesthetasktowakeupon
ablockedsystemcallevenbeforethesystemcalliscompleted(resultinginthe
systemcalloperationtofail).Suchinterruptionsareindicatedbyreturningthe
appropriatefailurecode.TheeffectofsignalsonataskintheTASK_KILLABLEstateis
similartoTASK_INTERRUPTIBLE,exceptthatwake-upisonlyeffectedonoccurrenceof
thefatalSIGKILLsignal.
EINTR,ERESTARTNOHAND,ERESTART_RESTARTBLOCK,ERESTARTSYS,orERESTARTNOINTRarevarious
kernel-definedfailurecodes;systemcallsareprogrammedtoreturnappropriate
errorflagsonfailure.Choiceoferrorcodedetermineswhetherfailedsystemcall
operationsarerestartedaftertheinterruptingsignalishandled:
(include/uapi/asm-generic/errno-base.h)
#defineEPERM1/*Operationnotpermitted*/
#defineENOENT2/*Nosuchfileordirectory*/
#defineESRCH3/*Nosuchprocess*/
#defineEINTR4/*Interruptedsystemcall*/
#defineEIO5/*I/Oerror*/
#defineENXIO6/*Nosuchdeviceoraddress*/
#defineE2BIG7/*Argumentlisttoolong*/
#defineENOEXEC8/*Execformaterror*/
#defineEBADF9/*Badfilenumber*/
#defineECHILD10/*Nochildprocesses*/
#defineEAGAIN11/*Tryagain*/
#defineENOMEM12/*Outofmemory*/
#defineEACCES13/*Permissiondenied*/
#defineEFAULT14/*Badaddress*/
#defineENOTBLK15/*Blockdevicerequired*/
#defineEBUSY16/*Deviceorresourcebusy*/
#defineEEXIST17/*Fileexists*/
#defineEXDEV18/*Cross-devicelink*/
#defineENODEV19/*Nosuchdevice*/
#defineENOTDIR20/*Notadirectory*/
#defineEISDIR21/*Isadirectory*/
#defineEINVAL22/*Invalidargument*/
#defineENFILE23/*Filetableoverflow*/
#defineEMFILE24/*Toomanyopenfiles*/
#defineENOTTY25/*Notatypewriter*/
#defineETXTBSY26/*Textfilebusy*/
#defineEFBIG27/*Filetoolarge*/
#defineENOSPC28/*Nospaceleftondevice*/
#defineESPIPE29/*Illegalseek*/
#defineEROFS30/*Read-onlyfilesystem*/
#defineEMLINK31/*Toomanylinks*/
#defineEPIPE32/*Brokenpipe*/
#defineEDOM33/*Mathargumentoutofdomainoffunc*/
#defineERANGE34/*Mathresultnotrepresentable*/
linux/errno.h)
#defineERESTARTSYS512
#defineERESTARTNOINTR513
#defineERESTARTNOHAND514/*restartifnohandler..*/
#defineENOIOCTLCMD515/*Noioctlcommand*/
#defineERESTART_RESTARTBLOCK516/*restartbycallingsys_restart_syscall*/
#defineEPROBE_DEFER517/*Driverrequestsproberetry*/
#defineEOPENSTALE518/*openfoundastaledentry*/
Onreturnfromaninterruptedsystemcall,theuser-modeAPIalwaysreturnsthe
EINTRerrorcode,irrespectiveofthespecificerrorcodereturnedbytheunderlying
kernelserviceroutine.Theremainingerrorcodesareusedbythesignal-delivery
routinesofthekerneltodeterminewhetherinterruptedsystemcallscanbe
restartedonreturnfromthesignalhandler.
Thefollowingtableshowstheerrorcodesforwhensystemcallexecutiongets
interruptedandtheeffectithasforvarioussignaldispositions:
Thisiswhattheymean:
NoRestart:Thesystemcallwillnotberestarted.Theprocesswillresume
executioninusermodefromtheinstructionthatfollowsthesystemcall(int
$0x80orsysenter).
AutoRestart:Thekernelforcestheuserprocesstore-initiatethesystem
calloperationbyloadingthecorrespondingsyscallidentifierintoeaxand
executingthesyscallinstruction(int$0x80orsysenter).
ExplicitRestart:Thesystemcallisrestartedonlyiftheprocesshas
enabledtheSA_RESTARTflagwhilesettingupthehandler(throughsigaction)
fortheinterruptingsignal.
Summary
Signals,thougharudimentaryformofcommunicationengagedbyprocessesand
kernelservices,provideaneasyandeffectivewaytogetasynchronousresponses
fromarunningprocessonoccurrenceofvariousevents.Byunderstandingall
coreaspectsofsignalusage,theirrepresentation,datastructuresandkernel
routinesforsignalgenerationanddelivery,wearenowmorekernelawareand
alsobetterpreparedtolookatmoresophisticatedmeansofcommunication
betweenprocesses,inalaterpartofthisbook.Afterhavingspentthefirstthree
chaptersonprocessesandtheirrelatedaspects,weshallnowdelveintoother
subsystemsofthekerneltonotchupourvisibility.Inthenextchapter,wewill
buildourunderstandingofoneofthecoreaspectsofthekernel,thememory
subsystem.
Throughoutthenextchapter,wewillgothroughcomprehendingstepbystep
manycriticalaspectsofmemorymanagementsuchasmemoryinitialization,
pagingandprotection,andkernelmemoryallocationalgorithms,amongothers.
MemoryManagementandAllocators
Theefficiencyofmemorymanagementbroadlysetstheefficiencyofthewhole
kernel.Casuallymanagedmemorysystemscanseriouslyimpactthe
performanceofothersubsystems,makingmemoryacriticalcomponentofthe
kernel.Thissubsystemsetsallprocessesandkernelservicesinmotionby
virtualizingphysicalmemoryandmanagingalldynamicallocationrequests
initiatedbythem.Thememorysubsystemalsohandlesawidespectrumof
operationsinsustainingoperationalefficiencyandoptimizingresources.The
operationsarebotharchitecturespecificandindependent,whichmandatesthe
overalldesignandimplementationtobejustandtweakable.Wewillcloselylook
atthefollowingaspectsinthischapterinourefforttocomprehendthiscolossal
subsystem:
Physicalmemoryrepresentation
Conceptsofnodesandzones
Pageallocator
Buddysystem
Kmallocallocations
Slabcaches
Vmallocallocations
Contiguousmemoryallocations
Initializationoperations
Inmostarchitectures,onreset,processorisinitializedinnormalorphysical
addressmode(alsocalledrealmodeinx86)andbeginsexecutingtheplatform's
firmwareinstructionsfoundattheresetvector.Thesefirmwareinstructions
(whichcanbesinglebinaryormulti-stagebinary)areprogrammedtocarryout
variousoperations,whichincludeinitializationofthememorycontroller,
calibrationofphysicalRAM,andloadingthebinarykernelimageintoaspecific
regionofphysicalmemory,amongothers.
Wheninrealmode,processorsdonotsupportvirtualaddressing,andLinux,
whichisdesignedandimplementedforsystemswithprotectedmode,requires
virtualaddressingtoenableprocessprotectionandisolation,acrucial
abstractionprovidedbythekernel(recallfromChapter1,Comprehending
Processes,AddressSpace,andThreads).Thismandatestheprocessortobe
switchedintoprotectedmodeandturnonvirtualaddresssupportbeforethe
kernelkicksinandbeginsitsbootoperationsandinitializationofsubsystems.
SwitchingtoprotectedmoderequirestheMMUchipsettobeinitialized,by
settingupappropriatecoredatastructures,intheprocessenablingpaging.These
operationsarearchitecturespecificandareimplementedinarchbranchofthe
kernelsourcetree.Duringkernelbuildthesesourcesarecompiledandlinkedas
aheadertoprotectedmodekernelimage;thisheaderisreferredasthekernel
bootstraporrealmodekernel.
Followingisthemain()routineofx86architecture'sbootstrap;thisfunctionis
executedinrealmodeandisresponsibleforallocatingappropriateresources
beforesteppingintoprotectedmodebyinvokinggo_to_protected_mode():
/*arch/x86/boot/main.c*/
voidmain(void)
{
/*First,copythebootheaderintothe"zeropage"*/
copy_boot_params();
/*Initializetheearly-bootconsole*/
console_init();
if(cmdline_find_option_bool("debug"))
puts("earlyconsoleinsetupcoden");
/*Endofheapcheck*/
init_heap();
/*MakesurewehavealltheproperCPUsupport*/
if(validate_cpu()){
puts("Unabletoboot-pleaseuseakernelappropriate"
"foryourCPU.n");
die();
}
/*TelltheBIOSwhatCPUmodeweintendtorunin.*/
set_bios_mode();
/*Detectmemorylayout*/
detect_memory();
/*Setkeyboardrepeatrate(why?)andquerythelockflags*/
keyboard_init();
/*QueryIntelSpeedStep(IST)information*/
query_ist();
/*QueryAPMinformation*/
#ifdefined(CONFIG_APM)||defined(CONFIG_APM_MODULE)
query_apm_bios();
#endif
/*QueryEDDinformation*/
#ifdefined(CONFIG_EDD)||defined(CONFIG_EDD_MODULE)
query_edd();
#endif
/*Setthevideomode*/
set_video();
/*Dothelastthingsandinvokeprotectedmode*/
go_to_protected_mode();
}
RealmodekernelroutinesthatareinvokedforsettingupMMUandhandle
transitionintoprotectedmodearearchitecturespecific(wewillnotbetouching
onthoseroutineshere).Irrespectiveofthearchitecture-specificcodeengaged,
theprimaryobjectiveistoenablesupportforvirtualaddressingbyturningon
paging.Withpagingenabled,systembeginstoperceivephysicalmemory
(RAM)asanarrayofblocksoffixedsize,calledpageframes.Sizeofapage
frameisconfiguredbyprogrammingthepagingunitofMMUappropriately;
mostMMUssupport4k,8k,16k,64kupto4MBoptionsforframesize
configuration.However,Linuxkernel'sdefaultbuildconfigurationformost
architectureschooses4kasitsstandardpageframesize.
Pagedescriptor
Pageframesarethesmallestpossibleallocationunitsofmemoryandkernel
needstoutilizethemforallitsmemoryneeds.Somepageframeswouldbe
requiredformappingphysicalmemorytovirtualaddressspacesofusermode
processes,someforkernelcodeanditsdatastructures,andsomeforprocessing
dynamicallocationrequestsraisedbyprocessorakernelservice.Forefficient
managementofsuchoperations,kernelneedstodistinguishbetweenpage
framescurrentlyinusefromthosewhicharefreeandavailable.Thispurposeis
achievedthroughanarchitecture-independentdatastructurecalledstructpage,
whichisdefinedtoholdallmetadatapertainingtoapageframe,includingits
currentstate.Aninstanceofstructpageisallocatedforeachphysicalpageframe
found,andkernelhastomaintainalistofpageinstancesinmainmemoryallthe
time.
Pagestructureisoneoftheheavilyuseddatastructuresofthekernel,andis
referredfromvariouskernelcodepaths.Thisstructureispopulatedwithdiverse
elements,whoserelevanceisentirelybasedonthestateofthephysicalframe.
Forinstance,specificmembersofpagestructurespecifyifcorresponding
physicalpageismappedtovirtualaddressspaceofaprocess,oragroupof
process.Suchfieldsarenotconsideredvalidwhenthephysicalpagehasbeen
reservedfordynamicallocations.Toensurethatpageinstanceinmemoryis
allocatedonlywithrelevantfields,unionsareheavilyusedtopopulatemember
fields.Thisisaprudentchoice,sinceitenablescrammingmoreinformationinto
thepagestructurewithoutincreasingitssizeinmemory:
/*include/linux/mm-types.h*/
/*Theobjectsinstructpageareorganizedindoublewordblocksin
*ordertoallowsustouseatomicdoublewordoperationsonportions
*ofstructpage.Thatiscurrentlyonlyusedbyslubbutthearrangement
*allowstheuseofatomicdoublewordoperationsontheflags/mapping
*andlrulistpointersalso.
*/
structpage{
/*Firstdoublewordblock*/
unsignedlongflags;/*Atomicflags,somepossiblyupdatedasynchronously*/union{
structaddress_space*mapping;
void*s_mem;/*slabfirstobject*/
atomic_tcompound_mapcount;/*firsttailpage*/
/*page_deferred_list().next--secondtailpage*/
};
....
....
}
Followingisabriefdescriptionofimportantmembersofpagestructure.Note
thatalotofthedetailshereassumeyourfamiliaritywithotheraspectsof
memorysubsystemwhichwediscussinfurthersectionsofthischapter,suchas
memoryallocators,pagetables,andsoforth.Irecommendnewreaderstoskip
andrevisitthissectionafteryougetacquaintedwiththenecessaryprerequisites.
/*<span>Macrostocreatefunctiondefinitionsforpageflags</span>
<span>*/</span><br/><span>#defineTESTPAGEFLAG(uname,
lname,policy)\</span><br/><span>static__always_inlineint
Page##uname(structpage*page)\</span><br/><span>{return
test_bit(PG_##lname,&policy(page,0)->flags);}</span><br/><br/>
<span>#defineSETPAGEFLAG(uname,lname,policy)\</span>
<br/><span>static__always_inlinevoidSetPage##uname(structpage
*page)\</span><br/><span>{set_bit(PG_##lname,&policy(page,
1)->flags);}</span><br/><br/><span>#define
CLEARPAGEFLAG(uname,lname,policy)\</span><br/>
<span>static__always_inlinevoidClearPage##uname(structpage
*page)\</span><br/><span>{clear_bit(PG_##lname,&policy(page,
1)->flags);}</span><br/><br/><span>#define
__SETPAGEFLAG(uname,lname,policy)\</span><br/><span>static
__always_inlinevoid__SetPage##uname(structpage*page)\</span>
<br/><span>{__set_bit(PG_##lname,&policy(page,1)->flags);}
</span><br/><br/><span>#define__CLEARPAGEFLAG(uname,
lname,policy)\</span><br/><span>static__always_inlinevoid
__ClearPage##uname(structpage*page)\</span><br/><span>{
__clear_bit(PG_##lname,&policy(page,1)->flags);}</span><br/>
<br/><span>#defineTESTSETFLAG(uname,lname,policy)\</span>
<br/><span>static__always_inlineintTestSetPage##uname(struct
page*page)\</span><br/><span>{return
test_and_set_bit(PG_##lname,&policy(page,1)->flags);}</span>
<br/><br/><span>#defineTESTCLEARFLAG(uname,lname,policy)
\</span><br/><span>static__always_inlineint
TestClearPage##uname(structpage*page)\</span><br/><span>{
returntest_and_clear_bit(PG_##lname,&policy(page,1)->flags);}
<br/><br/></span><em>....<br/>....<br/></em>
Mapping
Anotherimportantelementofthepagedescriptorisapointer*mappingoftype
structaddress_space.However,thisisoneofthetrickypointerswhichmighteither
refertoaninstanceofstructaddress_space,ortoaninstanceofstructanon_vma.
Beforewegetintodetailsofhowthisisachieved,let'sfirstunderstandthe
importanceofthosestructuresandtheresourcestheyrepresent.
Filesystemsengagefreepages(frompagecache)tocachedataofrecently
accesseddiskfiles.ThismechanismhelpsminimizediskI/Ooperations:when
filedatainthecacheismodified,theappropriatepageismarkeddirtybysetting
thePG_dirtybit;alldirtypagesarewrittentothecorrespondingdiskblockby
schedulingdiskI/Oatstrategicintervals.structaddress_spaceisanabstractionthat
representsasetofpagesengagedforafilecache.Freepagesofthepagecache
canalsobemappedtoaprocessorprocessgroupfordynamicallocations,pages
mappedforsuchallocationsarereferredtoasanonymouspagemappings.An
instanceofstructanon_vmarepresentsamemoryblockcreatedwithanonymous
pages,thataremappedtothevirtualaddressspace(throughVMAinstance)ofa
processorprocesses.
Thetrickydynamicinitializationofthepointerwithaddresstoeitherofthedata
structuresisachievedbybitmanipulations.Iflowbitofpointer*mappingisclear,
thenitisanindicationthatthepageismappedtoaninodeandthepointerrefers
tostructaddress_space.Iflowbitisset,itisanindicationforanonymousmapping,
whichmeansthepointerreferstoaninstanceofstructanon_vma.Thisismade
possiblebyensuringallocationofaddress_spaceinstancesalignedtosizeof(long),
whichmakestheleastsignificantbitofapointertoaddress_spacebeunset(thatis,
setto0).
Zonesandnodes
Principaldatastructuresthatareelementaryforentirememorymanagement
frameworkarezonesandnodes.Let'sfamiliarizeourselveswithcoreconcepts
behindthesedatastructures.
Memoryzones
Forefficientmanagementofmemoryallocations,physicalpagesareorganized
intogroupscalledzones.Pagesineachzoneareutilizedforspecificneedslike
DMA,highmemory,andotherregularallocationneeds.Anenuminkernelheader
mmzone.hdeclareszoneconstants:
/*include/linux/mmzone.h*/
enumzone_type{
#ifdefCONFIG_ZONE_DMA
ZONE_DMA,
#endif
#ifdefCONFIG_ZONE_DMA32
ZONE_DMA32,
#endif
#ifdefCONFIG_HIGHMEM
ZONE_HIGHMEM,
#endif
ZONE_MOVABLE,
#ifdefCONFIG_ZONE_DEVICE
ZONE_DEVICE,
#endif
__MAX_NR_ZONES
};
ZONE_DMA:
PagesinthiszonearereservedfordeviceswhichcannotinitiateDMAonall
addressablememory.Sizeofthiszoneisarchitecturespecific:
Architecture Limit
parsic,ia64,sparc <4G
s390 <2G
ARM variable
alpha unlimitedor<16MB
alpha,i386,x86-64 <16MB
ZONE_DMA32:Thiszoneisusedforsupporting32-bitdeviceswhichcanperform
DMAon<4Gofmemory.Thiszoneisonlypresentonx86-64platforms.
ZONE_NORMAL:Alladdressablememoryisconsideredtobenormalzone.DMA
operationscanbeinitiatedonthesepages,providedDMAdevicessupportall
addressablememory.
ZONE_HIGHMEM:Thiszonecontainspagesthatareonlyaccessiblebykernelthrough
explicitmappingintoitsaddressspace;inotherwords,allphysicalmemory
pagesbeyondkernelsegmentfallintothiszone.Thiszoneexistsonlyfor32-bit
platformswith3:1virtualaddresssplit(3Gforusermodeand1Gaddressspace
forkernel);forinstanceoni386,allowingthekerneltoaddressmemorybeyond
900MBwillrequiresettingupspecialmappings(pagetableentries)foreach
pagethatthekernelneedstoaccess.
ZONE_MOVABLE:Memoryfragmentationisoneofthechallengesformodern
operatingsystemstohandle,andLinuxisnoexceptiontothis.Rightfromthe
momentkernelboots,throughoutitsruntime,pagesareallocatedanddeallocated
foranarrayoftasks,resultinginsmallregionsofmemorywithphysically
contiguouspages.ConsideringLinuxsupportforvirtualaddressing,
fragmentationmightnotbeanobstacleforsmoothexecutionofvarious
processes,sincephysicallyscatteredmemorycanalwaysbemappedtovirtually
contiguousaddressspacethroughpagetables.Yet,thereareafewscenarioslike
DMAallocationsandsettingupcachesforkerneldatastructuresthathavea
stringentneedforphysicallycontiguousregions.
Overtheyears,kerneldevelopershavebeenevolvingnumerousanti-
fragmentationtechniquestoalleviatefragmentation.IntroductionofZONE_MOVABLE
isoneofthoseattempts.Thecoreideahereistotrackmovablepagesineach
zoneandrepresentthemunderthispseudozone,whichhelpsprevent
fragmentation(wediscussmoreonthisinthenextsectiononthebuddysystem).
Thesizeofthiszoneistobeconfiguredatboottimethroughoneofthekernel
parameterskernelcore;notethatthevalueassignedspecifiestheamountof
memoryconsiderednon-movable,andtherest,movable.Asageneralrule,the
memorymanagerisconfiguredtoconsidermigrationofpagesfromthehighest
populatedzonetoZONE_MOVABLE,whichisprobablygoingtobeZONE_HIGHMEMforx86
32-bitmachinesandZONE_DMA32onx86_64.
ZONE_DEVICE:Thiszonehasbeencarvedouttosupporthotplugmemories,like
largecapacitypersistent-memoryarrays.Persistentmemoriesareverysimilar
toDRAMinmanyways;specifically,CPUscandirectlyaddressthematbyte
level.However,characteristicssuchaspersistence,performance(slowerwrites),
andsize(usuallymeasuredinterabytes)separatethemfromnormalmemory.For
thekerneltosupportsuchmemorieswith4KBpagesize,itwouldneedto
enumeratebillionsofpagestructures,whichwouldconsumesignificantpercent
ofmainmemoryornotbefitatall.Asaresult,itwaschosenbykernel
developerstoconsiderpersistentmemoryadevice,ratherthanlikememory;
whichmeansthatthekernelcanfallbackonappropriatedriverstomanagesuch
memories.
void*devm_memremap_pages(structdevice*dev,structresource*res,
structpercpu_ref*ref,structvmem_altmap*altmap);
Thedevm_memremap_pages()routineofthepersistentmemorydrivermapsaregionof
persistentmemoryintokernel'saddressspacewithrelevantpagestructuresset
upinpersistentdevicememory.Allpagesunderthesemappingsaregrouped
underZONE_DEVICE.Havingadistinctzonetotagsuchpagesallowsthememory
managertodistinguishthemfromregularuniformmemorypages.
Memorynodes
Linuxkernelisimplementedtosupportmulti-processormachinearchitectures
foralongtimenow.Kernelimplementsvariousresourcessuchasper-CPUdata
caches,mutualexclusionlocks,andatomicoperationmacros,whichareused
acrossvariousSMP-awaresubsystems,suchasprocessscheduleranddevice
management,amongothers.Inparticular,theroleofmemorymanagement
subsystemiscrucialforkerneltotickonsucharchitectures,sinceitneedsto
virtualizememoryasviewedbyeachprocessor.Multi-processormachine
architecturesarebroadlycategorizedintotwotypesbasedoneachprocessor's
perception,andaccesslatencytomemoryonthesystem.
UniformMemoryAccessArchitecture(UMA):Thesearemulti-processor
architecturemachines,whereprocessorsarejoinedthroughaninterconnectand
sharephysicalmemoryandI/Oports.TheyarenamedasUMAsystemsdueto
memoryaccesslatency,whichisuniformandfixedirrespectiveoftheprocessor
fromwhichtheywereinitiated.Mostsymmetricmulti-processorsystemsare
UMA.
Non-UniformMemoryAccessArchitecture(NUMA):Thesearemulti-
processormachineswithacontrastingdesigntothatofUMA.Thesesystemsare
designedwithdedicatedmemoryforeachprocessorwithfixedtimeaccess
latencies.However,processorscaninitiateaccessoperationsonlocalmemoryof
otherprocessorsthroughappropriateinterconnects,andsuchoperationsrender
variabletimeaccesslatencies.
MachinesofthismodelareappropriatelynamedNUMAduetonon-uniform
(non-contiguous)viewofsystemsmemoryforeachprocessor:
ToextendsupportforNUMAmachines,kernelviewseachnonuniformmemory
partition(localmemory)asanode.Eachnodeisidentifiedbyadescriptoroftype
pg_data_t,whichreferstopagesunderthatnodeasperzoningpolicy,discussed
earlier.Eachzoneisrepresentedthroughaninstanceofstructzone.UMA
machineswouldcontainonenodedescriptorunderwhichtheentirememoryis
represented,andonNUMAmachines,alistofnodedescriptorsareenumerated,
eachrepresentingacontiguousmemorynode.Thefollowingdiagramillustrates
therelationshipbetweenthesedatastructures:
Weshallfollowonwithnodeandzonedescriptordatastructuredefinitions.Note
thatwedonotintendtodescribeeveryelementofthesestructuresastheyare
relatedtovariousaspectsofmemorymanagementwhichareoutofscopeofthis
chapter.
/*include/linux/mmzone.h*/<em><br/><br/></em>typedefstruct
pglist_data{<br/><strong>structzone
node_zones[MAX_NR_ZONES];</strong><br/><strong>struct
zonelistnode_zonelists[MAX_ZONELISTS];</strong><br/><strong>
intnr_zones;</strong><br/><br/>#ifdef
CONFIG_FLAT_NODE_MEM_MAP/*means!SPARSEMEM
*/<br/>structpage*node_mem_map;<br/>#ifdef
CONFIG_PAGE_EXTENSION<br/>structpage_ext
*node_page_ext;<br/>#endif<br/>#endif<br/><br/>#ifndef
CONFIG_NO_BOOTMEM<br/><strong>structbootmem_data
*bdata;</strong><br/>#endif<br/>#ifdef
CONFIG_MEMORY_HOTPLUG<br/>spinlock_tnode_size_lock;
<br/>#endif<br/><strong>unsignedlongnode_start_pfn;</strong>
<br/><strong>unsignedlongnode_present_pages;/*totalnumberof
physicalpages*/</strong><br/><strong>unsignedlong
node_spanned_pages;</strong><br/><strong>intnode_id;</strong>
<br/>wait_queue_head_tkswapd_wait;<br/>wait_queue_head_t
pfmemalloc_wait;<br/>structtask_struct*kswapd;<br/>int
kswapd_order;<br/>enumzone_typekswapd_classzone_idx;<br/>
<br/>#ifdefCONFIG_COMPACTION<br/>int
kcompactd_max_order;<br/>enumzone_type
kcompactd_classzone_idx;<br/>wait_queue_head_tkcompactd_wait;
<br/>structtask_struct*kcompactd;<br/>#endif<br/>#ifdef
CONFIG_NUMA_BALANCING<br/>spinlock_t
numabalancing_migrate_lock;<br/>unsignedlong
numabalancing_migrate_next_window;<br/>unsignedlong
numabalancing_migrate_nr_pages;<br/>#endif<br/><strong>
unsignedlongtotalreserve_pages;</strong><br/><br/>#ifdef
CONFIG_NUMA<br/>unsignedlongmin_unmapped_pages;<br/>
unsignedlongmin_slab_pages;<br/>#endif/*CONFIG_NUMA
*/<br/><br/>ZONE_PADDING(_pad1_)<br/>spinlock_tlru_lock;
<br/><br/>#ifdef
CONFIG_DEFERRED_STRUCT_PAGE_INIT<br/>unsignedlong
first_deferred_pfn;<br/>#endif/*
CONFIG_DEFERRED_STRUCT_PAGE_INIT*/<br/><br/>#ifdef
CONFIG_TRANSPARENT_HUGEPAGE<br/>spinlock_t
split_queue_lock;<br/>structlist_headsplit_queue;<br/>unsigned
longsplit_queue_len;<br/>#endif<br/>unsignedintinactive_ratio;
<br/>unsignedlongflags;<br/><br/>ZONE_PADDING(_pad2_)
<br/>structper_cpu_nodestat__percpu*per_cpu_nodestats;<br/>
atomic_long_tvm_stat[NR_VM_NODE_STAT_ITEMS];<br/>}
<strong>pg_data_t</strong>;<em><br/><br/></em>
Dependingonthetypeofmachineandkernelconfigurationchosen,
variouselementsarecompiledintothisstructure.We'lllookatfew
importantelements:
Field Description
node_zones Anarraythatholdszoneinstancesforpagesinthisnode.
node_zonelists Anarraythatspecifiespreferredallocationorderfor
zonesinthenode.
nr_zones Countofzonesinthecurrentnode.
node_mem_map Pointertolistofpagedescriptorsinthecurrentnode.
bdata Pointertobootmemorydescriptor(discussedinlater
section)
node_start_pfn Holdsframenumberofthefirstphysicalpageinthis
node;thisvaluewouldbezeroforUMAsystems.
node_present_pages Totalcountofpagesinthenode
node_spanned_pages Totalsizeofphysicalpagerange,includingholesifany.
node_id Holdsuniquenodeidentifier(nodesarenumberedfrom
zero)
kswapd_wait Waitqueueofkswapdkernelthread
kswapd Pointertotaskstructureofkswapdkernelthread
totalreserve_pages Countofreservepagesnotusedforuserspaceallocations
structzone{<br/>/*Read-mostlyfields*/<br/><br/>/*zone
watermarks,accesswith*_wmark_pages(zone)macros*/<br/>
<strong>unsignedlongwatermark[NR_WMARK];</strong><br/>
<br/><strong>unsignedlongnr_reserved_highatomic;</strong><br/>
<br/>/*<br/>*Wedon'tknowifthememorythatwe'regoingto
allocatewillbe<br/>*freeableor/anditwillbereleasedeventually,
sotoavoidtotally<br/>*wastingseveralGBoframwemustreserve
someofthelowerzone<br/>*memory(otherwisewerisktorun
OOMonthelowerzonesdespite<br/>*therebeingtonsoffreeable
ramonthehigherzones).Thisarrayis<br/>*recalculatedatruntime
ifthesysctl_lowmem_reserve_ratiosysctl<br/>*changes.<br/>
*/<br/><strong>longlowmem_reserve[MAX_NR_ZONES];
</strong><br/><br/>#ifdefCONFIG_NUMA<br/>intnode;
<br/>#endif<br/><strong>structpglist_data*zone_pgdat;</strong>
<br/><strong>structper_cpu_pageset__percpu*pageset;</strong>
<br/><br/>#ifndefCONFIG_SPARSEMEM<br/>/*<br/>*Flagsfor
apageblock_nr_pagesblock.Seepageblock-flags.h.<br/>*In
SPARSEMEM,thismapisstoredinstructmem_section<br/>*/<br/>
unsignedlong*pageblock_flags;<br/>#endif/*
CONFIG_SPARSEMEM*/<br/><br/>/*zone_start_pfn==
zone_start_paddr>>PAGE_SHIFT*/<br/><strong>unsignedlong
zone_start_pfn;</strong><br/><br/>/*<br/>*spanned_pagesisthe
totalpagesspannedbythezone,including<br/>*holes,whichis
calculatedas:<br/>*spanned_pages=zone_end_pfn-
zone_start_pfn;<br/>*<br/>*present_pagesisphysicalpages
existingwithinthezone,which<br/>*iscalculatedas:<br/>*
present_pages=spanned_pages-absent_pages(pagesinholes);<br/>
*<br/>*managed_pagesispresentpagesmanagedbythebuddy
system,which<br/>*iscalculatedas(reserved_pagesincludespages
allocatedbythe<br/>*bootmemallocator):<br/>*managed_pages=
present_pages-reserved_pages;<br/>*<br/>*Sopresent_pagesmay
beusedbymemoryhotplugormemorypower<br/>*management
logictofigureoutunmanagedpagesbychecking<br/>*
(present_pages-managed_pages).Andmanaged_pagesshouldbe
used<br/>*bypageallocatorandvmscannertocalculateallkindsof
watermarks<br/>*andthresholds.<br/>*<br/>*Lockingrules:<br/>
*<br/>*zone_start_pfnandspanned_pagesareprotectedby
span_seqlock.<br/>*Itisaseqlockbecauseithastobereadoutside
ofzone->lock,<br/>*anditisdoneinthemainallocatorpath.But,it
iswritten<br/>*quiteinfrequently.<br/>*<br/>*Thespan_seqlock
isdeclaredalongwithzone->lockbecauseitis<br/>*frequentlyread
inproximitytozone->lock.It'sgoodto<br/>*givethemachanceof
beinginthesamecacheline.<br/>*<br/>*Writeaccessto
present_pagesatruntimeshouldbeprotectedby<br/>*
mem_hotplug_begin/end().Anyreaderwhocan'ttolerantdrift
of<br/>*present_pagesshouldget_online_mems()togetastable
value.<br/>*<br/>*Readaccesstomanaged_pagesshouldbesafe
becauseit'sunsigned<br/>*long.Writeaccesstozone-
>managed_pagesandtotalram_pagesare<br/>*protectedby
managed_page_count_lockatruntime.Idealyonly<br/>*
adjust_managed_page_count()shouldbeusedinsteadofdirectly<br/>
*touchingzone->managed_pagesandtotalram_pages.<br/>*/<br/>
unsignedlongmanaged_pages;<br/>unsignedlongspanned_pages;
<br/>unsignedlongpresent_pages;<br/><br/><strong>constchar
*name;//</strong>nameofthiszone<br/><br/>#ifdef
CONFIG_MEMORY_ISOLATION<br/>/*<br/>*Numberof
isolatedpageblock.Itisusedtosolveincorrect<br/>*freepage
countingproblemduetoracyretrievingmigratetype<br/>*of
pageblock.Protectedbyzone->lock.<br/>*/<br/>unsignedlong
nr_isolate_pageblock;<br/>#endif<br/><br/>#ifdef
CONFIG_MEMORY_HOTPLUG<br/>/*seespanned/present_pages
formoredescription*/<br/>seqlock_tspan_seqlock;
<br/>#endif<br/><br/>intinitialized;<br/><br/>/*Write-intensive
fieldsusedfromthepageallocator*/<br/><strong>
ZONE_PADDING(_pad1_)</strong><br/><br/>/*freeareasof
differentsizes*/<br/><strong>structfree_area
free_area[MAX_ORDER];</strong><br/><br/>/*zoneflags,see
below*/<br/><strong>unsignedlongflags;</strong><br/><br/>/*
Primarilyprotectsfree_area*/<br/>spinlock_tlock;<br/><br/>/*
Write-intensivefieldsusedbycompactionandvmstats.*/<br/>
<strong>ZONE_PADDING(_pad2_)</strong><br/><br/>/*<br/>*
Whenfreepagesarebelowthispoint,additionalstepsaretaken<br/>
*whenreadingthenumberoffreepagestoavoidper-CPU
counter<br/>*driftallowingwatermarkstobebreached<br/>*/<br/>
unsignedlongpercpu_drift_mark;<br/><br/>#ifdefined
CONFIG_COMPACTION||definedCONFIG_CMA<br/>/*pfn
wherecompactionfreescannershouldstart*/<br/>unsignedlong
compact_cached_free_pfn;<br/>/*pfnwhereasyncandsync
compactionmigrationscannershouldstart*/<br/>unsignedlong
compact_cached_migrate_pfn[2];<br/>#endif<br/><br/>#ifdef
CONFIG_COMPACTION<br/>/*<br/>*Oncompactionfailure,
1<<compact_defer_shiftcompactions<br/>*areskippedbefore
tryingagain.Thenumberattemptedsince<br/>*lastfailureistracked
withcompact_considered.<br/>*/<br/>unsignedint
compact_considered;<br/>unsignedintcompact_defer_shift;<br/>int
compact_order_failed;<br/>#endif<br/><br/>#ifdefined
CONFIG_COMPACTION||definedCONFIG_CMA<br/>/*Setto
truewhenthePG_migrate_skipbitsshouldbecleared*/<br/>bool
compact_blockskip_flush;<br/>#endif<br/><br/>boolcontiguous;
<br/><br/><strong>ZONE_PADDING(_pad3_)</strong><br/>/*
Zonestatistics*/<br/><strong>atomic_long_t
vm_stat[NR_VM_ZONE_STAT_ITEMS];</strong><br/>}
____cacheline_internodealigned_in_smp;
Followingisthesummarizedtableofimportantfields,withshort
descriptionsforeachofthem:
Field Description
watermark
AnarrayofunsignedlongwithWRMARK_MIN,
WRMARK_LOW,andWRMARK_HIGHoffsets.Valuesin
theseoffsetsimpactswapoperationscarriedoutby
kswapdkernelthread.
nr_reserved_highatomic Holdscountofreservedhighorderatomicpages
lowmem_reserve Arraythatspecifiescountofpagesforeachzonethat
arereservedforcriticalallocations
zone_pgdat Pointertonodedescriptorforthiszone.
pageset Pointertoper-CPUhot-and-coldpagelists.
free_area
Anarrayofinstancesoftypestructfree_area,
eachabstractingcontiguousfreepagesmade
availableforbuddyallocator.Moreonbuddy
allocatorinalatersection.
flags Unsignedlongvariableusedtostorecurrentstatusof
thezone.
zone_start_pfn Indexoffirstpageframeinthezone
vm_stat Statisticalinformationofthezone
Memoryallocators
Havinglookedathowphysicalmemoryisorganized,andrepresentedthrough
coredatastructures,wewillnowshiftourattentiontomanagementofphysical
memoryforprocessingallocationanddeallocationrequests.Memoryallocation
requestscanberaisedbyvariousentitiesinthesystem,suchasusermode
process,drivers,andfilesystems.Dependingonthetypeofentityandcontext
fromwhichallocationisbeingrequested,allocationsreturnedmightneedto
meetcertaincharacteristics,suchaspage-alignedphysicallycontiguouslarge
blocksorphysicallycontiguoussmallblocks,hardwarecachealignedmemory,
orphysicallyfragmentedblocksthataremappedtovirtuallycontiguousaddress
space.
Toefficientlymanagephysicalmemory,andcatertomemoryasperchosen
priorityandpattern,thekernelengageswithagroupofmemoryallocators.Each
allocatorhasadistinctsetofinterfaceroutines,whicharebackedbyprecisely
designedalgorithmsoptimizedforaspecificallocationpattern.
Pageframeallocator
Alsocalledthezonedpageframeallocator,thisservesasaninterfacefor
physicallycontiguousallocationsinmultiplesofpagesize.Allocationoperations
arecarriedoutbylookingintoappropriatezonesforfreepages.Physicalpages
ineachzonearemanagedbyBuddySystem,whichservesasthebackend
algorithmforthepageframeallocator:
Kernelcodecaninitiatememoryallocation/deallocationoperationsonthis
algorithmthroughinterfaceinlinefunctionsandmacrosprovidedinthekernel
headerlinux/include/gfp.h:
staticinlinestructpage*alloc_pages(gfp_tgfp_mask,unsignedintorder);
Thefirstparametergfp_maskservesasameanstospecifyattributesasperwhich
allocationsaretobefulfilled;wewilllookintodetailsoftheattributeflagsin
comingsections.Thesecondparameterorderisusedtospecifysizeofthe
allocation;thevalueassignedisconsidered2order.Onsuccess,itreturnsthe
addressofthefirstpagestructure,andNULLonfailure.Forsinglepage
allocationsanalternatemacroismadeavailable,whichagainfallsbackon
alloc_pages():
#definealloc_page(gfp_mask)alloc_pages(gfp_mask,0);
Allocatedpage(s)aremappedontocontiguouskerneladdressspace,through
appropriatepagetableentries(forpagedaddresstranslationduringaccess
operations).Addressesgeneratedafterpagetablemapping,foruseinkernel
code,arereferredtoaslinearaddresses.Throughanotherfunctioninterface
page_address(),thecallercodecanretrievethestartlinearaddressoftheallocated
block.
Allocationscanalsobeinitiatedthroughasetofwrapperroutinesandmacros
toalloc_pages(),whichmarginallyextendfunctionalityandreturnthestartlinear
addressfortheallocatedchunk,insteadofpointertopagestructure.The
followingcodesnippetshowsalistofwrapperfunctionsandmacros:
/*allocates2orderpagesandreturnsstartlinearaddress*/
unsignedlong__get_free_pages(gfp_tgfp_mask,unsignedintorder)
{
structpage*page;
/*
*__get_free_pages()returnsa32-bitaddress,whichcannotrepresent
*ahighmempage
*/
VM_BUG_ON((gfp_mask&__GFP_HIGHMEM)!=0);
page=alloc_pages(gfp_mask,order);
if(!page)
return0;
return(unsignedlong)page_address(page);
}
/*Returnsstartlinearaddresstozeroinitializedpage*/
unsignedlongget_zeroed_page(gfp_tgfp_mask)
{
return__get_free_pages(gfp_mask|__GFP_ZERO,0);
}
/*Allocatesapage*/
#define__get_free_page(gfp_mask)\
__get_free_pages((gfp_mask),0)
/*Allocatepage/pagesfromDMAzone*/
#define__get_dma_pages(gfp_mask,order)\
__get_free_pages((gfp_mask)|GFP_DMA,(order))
Followingaretheinterfacesforreleasingmemorybacktothesystem.Weneed
toinvokeanappropriateonethatmatchestheallocationroutine;passingan
incorrectaddresswillcausecorruption:
void__free_pages(structpage*page,unsignedintorder);
voidfree_pages(unsignedlongaddr,unsignedintorder);
voidfree_page(addr);
Buddysystem
Whilethepageallocatorservesasaninterfaceformemoryallocations(in
multiplesofpagesize),thebuddysystemoperatesattheback-endtoadminister
physicalpagemanagement.Thisalgorithmmanagesallphysicalpagesforeach
zone.Itisoptimizedtoaccomplishallocationsoflargephysicallycontiguous
blocks(pages),byminimizingexternalfragmentation.Let'sexploreits
operationaldetails.
Thezonedescriptorstructurecontainsanarrayofstructfree_area,andthesizeof
thearrayisdefinedthroughakernelmacroMAX_ORDERwhosedefaultvalueis11:
structzone{
...
...
structfree_area[MAX_ORDER];
...
...
};
Eachoffsetcontainsaninstanceoffree_areastructure.Allfreepagesaresplitinto
11(MAX_ORDER)lists,eachcontainingalistofblocksof2orderpages,withorder
valuesintherangeof0to11(thatis,alistofof22wouldcontain16KBsized
blocks,and23tobe32KBsizedblocks,andsoon).Thisstrategyensureseach
blocktobenaturallyaligned.Blocksineachlistareexactlydoubleinsizetothat
ofblocksinlowerlists,resultinginfasterallocationanddeallocationoperations.
Italsoprovidestheallocatorwiththecapabilitytohandlecontiguousallocations,
ofupto8MBblocksize(211list):
Whenanallocationrequestismadeforaparticularsize,thebuddysystemlooks
intotheappropriatelistforafreeblock,andreturnsitsaddress,ifavailable.
However,ifitcannotfindafreeblock,itmovestocheckinthenexthigh-order
listforalargerblock,whichifavailableitsplitsthehigher-orderblockinto
equalpartscalledbuddies,returnsonefortheallocator,andqueuesthesecond
intoalower-orderlist.Whenbothbuddyblocksbecomefreeatsomefuture
time,theyarecoalescedtocreatealargerblock.Algorithmcanidentifybuddy
blocksthroughtheiralignedaddress,whichmakesitpossibletocoalescethem.
Let'sconsideranexampletocomprehendthisbetter,assumingtherewerea
requesttoallocatean8kblock(throughpageallocatorroutines).Buddysystem
looksforfreeblocksinan8klistofthefree_pagesarray(firstoffsetcontaining21
sizedblocks),andreturnsthestartlinearaddressoftheblockifavailable;
however,iftherearenofreeblocksinthe8klist,itmovesontothenexthigher-
orderlist,whichisof16kblocks(secondoffsetofthefree_pagesarray)tofinda
freeblock.Let'sfurtherassumethattherewerenofreeblockinthislistaswell.
Itthenmovesaheadintothenexthigh-orderlistofsize32k(thirdoffsetinthe
free_pagesarray)tofindafreeblock;ifavailable,itsplitsthe32kblockintotwo
equalhalvesof16keach(buddies).Thefirst16kchunkisfurthersplitintotwo
halvesof8k(buddies)ofwhichoneisallocatedforthecallerandotherisput
intothe8klist.Thesecondchunkof16kisputintothe16kfreelist,whenlower
order(8k)buddiesbecomefreeatsomefuturetime,theyarecoalescedtoforma
higher-order16kblock.Whenboth16kbuddiesbecomefree,theyareagain
coalescedtoarriveata32kblockwhichisputbackintothefreelist.
Whenarequestforallocationfromadesiredzonecannotbeprocessed,the
buddysystemusesafallbackmechanismtolookforotherzonesandnodes:
Thebuddysystemhasalonghistorywithextensiveimplementationsacross
various*nixoperatingsystemswithappropriateoptimizations.Asdiscussed
earlier,ithelpsfastermemoryallocationanddeallocations,anditalsominimizes
externalfragmentationtosomedegree.Withtheadventofhugepages,which
providemuch-neededperformancebenefits,ithasbecomeallthemore
importanttofurthereffortstowardanti-fragmentation.Toaccomplishthis,the
Linuxkernel'simplementationofthebuddysystemisequippedwithanti-
fragmentationcapabilitythroughpagemigration.
Pagemigrationisaprocessofmovingdataofavirtualpagefromonephysical
memoryregiontoanother.Thismechanismhelpscreatelargerblockswith
contiguouspages.Torealizethis,pagesarecategorizedintothefollowingtypes:
1.Unmovablepages:Physicalpageswhicharepinnedandreservedfora
specificallocationareconsideredunmovable.Pagespinnedforthecorekernel
fallintothiscategory.Thesepagesarenonreclaimable.
2.Reclaimablepages:Physicalpagesmappedtoadynamicallocationthatcan
beevictedtoabackstore,andthosewhichcanberegeneratedareconsidered
reclaimable.Pagesheldforfilecaching,anonymouspagemappings,andthose
heldbythekernel'sslabcachesfallintothiscategory.Reclaimoperationsare
carriedoutintwomodes:periodicanddirectreclaim,theformerisachieved
throughakthreadcalledkswapd.Whensystemrunsexceedinglyshortofmemory,
kernelentersintodirectreclaim.
3.Movablepages:Physicalpagesthatcanbemovedtodifferentregions
throughpagemigrationmechanism.Pagesmappedtovirtualaddressspaceof
user-modeprocessareconsideredmovable,sincealltheVMsubsystemneedsto
doiscopydataandchangerelevantpagetableentries.Thisworks,considering
allaccessoperationsfromtheusermodeprocessareputthroughpagetable
translations.
Thebuddysystemgroupspagesonthebasisofmovabilityintoindependentlists,
andusesthemforappropriateallocations.Thisisachievedbyorganizingeach2n
listinstructfree_areaasagroupofautonomouslistsbasedonmobilityofpages.
Eachfree_areainstanceholdsanarrayoflistsofsizeMIGRATE_TYPES.Eachoffset
holdslist_headofarespectivegroupofpages:
structfree_area{
structlist_headfree_list[MIGRATE_TYPES];
unsignedlongnr_free;
};
nr_freeisacounterthatholdsthetotalnumberoffreepagesforthisfree_area(all
migrationlistsputtogether).Thefollowingdiagramdepictsfreelistsforeach
migrationtype:
Thefollowingenumdefinespagemigrationtypes:
enum{
MIGRATE_UNMOVABLE,
MIGRATE_MOVABLE,
MIGRATE_RECLAIMABLE,
MIGRATE_PCPTYPES,/*thenumberoftypesonthepcplists*/
MIGRATE_HIGHATOMIC=MIGRATE_PCPTYPES,
#ifdefCONFIG_CMA
MIGRATE_CMA,
#endif
#ifdefCONFIG_MEMORY_ISOLATION
MIGRATE_ISOLATE,/*can'tallocatefromhere*/
#endif
MIGRATE_TYPES
};
WehavediscussedkeymigrationtypesMIGRATE_MOVABLE,MIGRATE_UNMOVABLE,and
MIGRATE_RECLAIMABLEtypes.MIGRATE_PCPTYPESisaspecialtypeintroducedtoimprove
systemsperformance;eachzonemaintainsalistofcache-hotpagesinaper-CPU
pagecache.Thesepagesareusedtoserveallocationrequestsraisedbythelocal
CPU.Thezonedescriptorstructurespagesetelementpointstopagesintheper-
CPUcache:
/*include/linux/mmzone.h*/
structper_cpu_pages{
intcount;/*numberofpagesinthelist*/
inthigh;/*highwatermark,emptyingneeded*/
intbatch;/*chunksizeforbuddyadd/remove*/
/*Listsofpages,onepermigratetypestoredonthepcp-lists*/
structlist_headlists[MIGRATE_PCPTYPES];
};
structper_cpu_pageset{
structper_cpu_pagespcp;
#ifdefCONFIG_NUMA
s8expire;
#endif
#ifdefCONFIG_SMP
s8stat_threshold;
s8vm_stat_diff[NR_VM_ZONE_STAT_ITEMS];
#endif
};
structzone{
...
...
structper_cpu_pageset__percpu*pageset;
...
...
};
structper_cpu_pagesetisanabstractionthatrepresentsunmovable,reclaimable,
andmovablepagelists.MIGRATE_PCPTYPESisacountofper-CPUpagelistssortedas
perpagemobility.MIGRATE_CMAislistofpagesforthecontiguousmemoryallocator,
whichweshalldiscussinfurthersections:
Thebuddysystemisimplementedtofallbackonthealternatelist,toprocessan
allocationrequestwhenpagesofdesiredmobilityarenotavailable.The
followingarraydefinesthefallbackorderforvariousmigrationtypes;wewill
notgointofurtherelaborationasitisselfexplanatory:
staticintfallbacks[MIGRATE_TYPES][4]={
[MIGRATE_UNMOVABLE]={MIGRATE_RECLAIMABLE,MIGRATE_MOVABLE,MIGRATE_TYPES},
[MIGRATE_RECLAIMABLE]={MIGRATE_UNMOVABLE,MIGRATE_MOVABLE,MIGRATE_TYPES},
[MIGRATE_MOVABLE]={MIGRATE_RECLAIMABLE,MIGRATE_UNMOVABLE,MIGRATE_TYPES},
#ifdefCONFIG_CMA
[MIGRATE_CMA]={MIGRATE_TYPES},/*Neverused*/
#endif
#ifdefCONFIG_MEMORY_ISOLATION
[MIGRATE_ISOLATE]={MIGRATE_TYPES},/*Neverused*/
#endif
};
typedefunsigned__bitwise__gfp_t;
Gfpflagsareusedtosupplytwovitalattributesfortheallocator
functions:thefirstisthemodeoftheallocation,whichcontrolsthe
behavioroftheallocatorfunction,andthesecondisthesourceofthe
allocation,whichindicatesthezoneorlistofzonesfromwhich
memorycanbesourced.Thekernelheadergfp.hdefinesvariousflag
constantsthatarecategorizedintodistinctgroups,calledzone
modifiers,mobilityandplacementflags,watermarkmodifiers,
reclaimmodifiers,andactionmodifiers.
#define__GFP_DMA((__forcegfp_t)___GFP_DMA)<br/>#define
__GFP_HIGHMEM((__forcegfp_t)___GFP_HIGHMEM)
<br/>#define__GFP_DMA32((__forcegfp_t)___GFP_DMA32)
<br/>#define__GFP_MOVABLE((__force
gfp_t)___GFP_MOVABLE)/*ZONE_MOVABLEallowed*/
#define__GFP_RECLAIMABLE((__force
gfp_t)___GFP_RECLAIMABLE)<br/>#define__GFP_WRITE
((__forcegfp_t)___GFP_WRITE)<br/>#define__GFP_HARDWALL
((__forcegfp_t)___GFP_HARDWALL)<br/>#define
__GFP_THISNODE((__forcegfp_t)___GFP_THISNODE)
<br/>#define__GFP_ACCOUNT((__force
gfp_t)___GFP_ACCOUNT)
Followingisalistofpagemobilityandplacementflags:
__GFP_RECLAIMABLE:Mostkernelsubsystemsaredesignedto
engagememorycachesforcachingfrequentlyneededresources
suchasdatastructures,memoryblocks,persistentfiledata,and
soon.Thememorymanagermaintainssuchcachesandallows
themtodynamicallyexpandondemand.However,suchcaches
cannotbeallowedtoexpandboundlessly,ortheywilleventually
consumeallmemory.Thememorymanagerhandlesthisissue
throughtheshrinkerinterface,amechanismbywhichthe
memorymanagercanshrinkacache,andreclaimpageswhen
needed.Enablingthisflagwhileallocatingpages(forthecache)
isanindicationtotheshrinkerthatthepageisreclaimable.This
flagisusedbytheslaballocator,whichisdiscussedinalater
section.
__GFP_WRITE:Whenthisflagisused,itindicatestothekernelthat
thecallerintendstodirtythepage.Thememorymanager
allocatestheappropriatepageasperthefair-zoneallocation
policy,whichround-robinstheallocationofsuchpagesacross
localzonesofthenodetoavoidallthedirtypagesbeinginone
zone.
__GFP_HARDWALL:Thisflagensuresthatallocationiscarriedout
onsamenodeornodestowhichthecallerisbound;inother
words,itenforcestheCPUSETmemoryallocationpolicy.
__GFP_THISNODE:Thisflagforcestheallocationtobesatisfied
fromtherequestednodewithnofallbacksorplacementpolicy
enforcements.
__GFP_ACCOUNT:Thisflagcausesallocationstobeaccountedfor
thekmemcontrolgroup.
#define__GFP_ATOMIC((__forcegfp_t)___GFP_ATOMIC)
<br/>#define__GFP_HIGH((__forcegfp_t)___GFP_HIGH)
<br/>#define__GFP_MEMALLOC((__force
gfp_t)___GFP_MEMALLOC)<br/>#define
__GFP_NOMEMALLOC((__force
gfp_t)___GFP_NOMEMALLOC)
Followingislistofwatermarkmodifiers,whichprovidecontrolover
emergencyreservepoolsofmemory:
__GFP_ATOMIC:Thisflagindicatesthatallocationishighpriority
andthecallercontextcannotbeputintowait.
__GFP_HIGH:Thisflagindicatesthatthecallerishighpriorityand
grantingallocationrequestisnecessaryforthesystemtomake
progress.Settingthisflagwillcausetheallocatortoaccessthe
emergencypool.
__GFP_MEMALLOC:Thisflagallowsaccesstoallmemory.This
shouldonlybeusedwhenthecallerguaranteestheallocationwill
allowmorememorytobefreedveryshortly,forexample,process
exitingorswapping.
__GFP_NOMEMALLOC:Thisflagisusedtoforbidaccesstoall
reservedemergencypools.
<span>#define__GFP_IO((__forcegfp_t)___GFP_IO)</span>
<br/>#define__GFP_FS((__forcegfp_t)___GFP_FS)<br/>#define
__GFP_DIRECT_RECLAIM((__force
gfp_t)___GFP_DIRECT_RECLAIM)/*Callercanreclaim
*/<br/>#define__GFP_KSWAPD_RECLAIM((__force
gfp_t)___GFP_KSWAPD_RECLAIM)/*kswapdcanwake
*/<br/>#define__GFP_RECLAIM((__forcegfp_t)
(___GFP_DIRECT_RECLAIM|___GFP_KSWAPD_RECLAIM))
<br/>#define__GFP_REPEAT((__forcegfp_t)___GFP_REPEAT)
<br/>#define__GFP_NOFAIL((__forcegfp_t)___GFP_NOFAIL)
<br/>#define__GFP_NORETRY((__force
gfp_t)___GFP_NORETRY)
Followingisalistofreclaimmodifiersthatcanbepassedas
argumentstoallocationroutines;eachflagenablesreclaimoperations
onaspecificregionofmemory:
__GFP_IO:Thisflagindicatesthattheallocatorcanstartphysical
I/O(swap)toreclaimmemory.
__GFP_FS:Thisflagindicatesthattheallocatormaycalldownto
thelow-levelFSforreclaim.
__GFP_DIRECT_RECLAIM:Thisflagindicatesthatthecalleris
willingtoenterdirectreclaim.Thismightcausethecallerto
block.
__GFP_KSWAPD_RECLAIM:Thisflagindicatesthattheallocatorcan
wakethekswapdkernelthreadtoinitiatereclaim,whenthelow
watermarkisreached.
__GFP_RECLAIM:Thisflagisusedtoenabledirectandkswapd
reclaim.
__GFP_REPEAT:Thisflagindicatestotryhardtoallocatethe
memory,buttheallocationattemptmightfail.
__GFP_NOFAIL:Thisflagforcesthevirtualmemorymanagerto
retryuntiltheallocationrequest.succeeds.Thismightcausethe
VMtotriggertheOOMkillertoreclaimmemory.
__GFP_NORETRY:Thisflagwillcausetheallocatortoreturn
appropriatefailurestatuswhentherequestcannotbeserved.
#define__GFP_COLD((__forcegfp_t)___GFP_COLD)<br/>#define
__GFP_NOWARN((__forcegfp_t)___GFP_NOWARN)<br/>#define
__GFP_COMP((__forcegfp_t)___GFP_COMP)<br/>#define
__GFP_ZERO((__forcegfp_t)___GFP_ZERO)<br/>#define
__GFP_NOTRACK((__forcegfp_t)___GFP_NOTRACK)
<br/>#define__GFP_NOTRACK_FALSE_POSITIVE
(__GFP_NOTRACK)<br/>#define__GFP_OTHER_NODE((__force
gfp_t)___GFP_OTHER_NODE)
Followingisalistofactionmodifierflags;theseflagsspecify
additionalattributestobeconsideredbytheallocatorroutineswhile
processingarequest:
__GFP_COLD:Toenablequickaccess,afewpagesineachzoneare
cachedintoper-CPUcaches;pagesheldincachearereferredto
ashot,anduncachedpagesarereferredtoascold.Thisflag
indicatesthattheallocatorshouldservememoryrequeststhrough
cachecoldpage(s).
__GFP_NOWARN:Thisflagcausestheallocatortoruninsilent
mode,whichresultsinwarninganderrorconditionstogo
unreported.
__GFP_COMP:Thisflagisusedtoallocateacompoundpagewith
appropriatemetadata.Acompoundpageisagroupoftwoor
morephysicallycontiguouspages,whicharetreatedasasingle
largepage.Metadatamakesacompoundpagedistinctfromother
physicallycontiguouspages.Thefirstphysicalpageofa
compoundpageiscalledtheheadpagewiththePG_headflagset
initspagedescriptor,andtherestofthepagesarereferredtoas
tailpages.
__GFP_ZERO:Thisflagcausestheallocatortoreturnzerofilled
page(s).
__GFP_NOTRACK:kmemcheckisoneofthein-kerneldebuggers
whichisuseddetectandwarnaboutuninitializedmemoryaccess.
Nonetheless,suchcheckscausememoryaccessoperationstobe
delayed.Whenperformanceisacriteria,thecallermightwantto
allocatememorywhichisnottrackedbykmemcheck.Thisflag
causestheallocatortoreturnsuchmemory.
__GFP_NOTRACK_FALSE_POSITIVE:Thisflagisanaliasof
__GFP_NOTRACK.
__GFP_OTHER_NODE:Thisflagisusedforallocationoftransparent
hugepages(THP).
<span>#defineGFP_ATOMIC
(__GFP_HIGH|__GFP_ATOMIC|__GFP_KSWAPD_RECLAIM)
</span><br/><span>#defineGFP_KERNEL(__GFP_RECLAIM|
__GFP_IO|__GFP_FS)</span><br/><span>#define
GFP_KERNEL_ACCOUNT(GFP_KERNEL|__GFP_ACCOUNT)
</span><br/><span>#defineGFP_NOWAIT
(__GFP_KSWAPD_RECLAIM)</span><br/><span>#define
GFP_NOIO(__GFP_RECLAIM)</span><br/><span>#define
GFP_NOFS(__GFP_RECLAIM|__GFP_IO)</span><br/>
<span>#defineGFP_TEMPORARY(__GFP_RECLAIM|__GFP_IO
|__GFP_FS|__GFP_RECLAIMABLE)</span><br/><span>#define
GFP_USER(__GFP_RECLAIM|__GFP_IO|__GFP_FS|
__GFP_HARDWALL)</span><br/><span>#defineGFP_DMA
__GFP_DMA</span><br/><span>#defineGFP_DMA32
__GFP_DMA32</span><br/><span>#defineGFP_HIGHUSER
(GFP_USER|__GFP_HIGHMEM)</span><br/><span>#define
GFP_HIGHUSER_MOVABLE(GFP_HIGHUSER|
__GFP_MOVABLE)</span><br/><span>#define
GFP_TRANSHUGE_LIGHT((GFP_HIGHUSER_MOVABLE|
__GFP_COMP|__GFP_NOMEMALLOC|\__GFP_NOWARN)&
~__GFP_RECLAIM)</span><br/><span>#define
GFP_TRANSHUGE(GFP_TRANSHUGE_LIGHT|
__GFP_DIRECT_RECLAIM)</span>
Thefollowingisthelistoftypeflags:
GFP_ATOMIC:Thisflagisspecifiedfornonblockingallocations
thatcannotfail.Thisflagwillcauseallocationsfromemergency
reserves.Thisisgenerallyusedwhileinvokingtheallocatorfrom
anatomiccontext.
GFP_KERNEL:Thisflagisusedwhileallocatingmemoryforkernel
use.Theserequestsareprocessedfromnormalzone.Thisflag
mightcausetheallocatortoenterdirectreclaim.
GFP_KERNEL_ACCOUNT:SameasGFP_KERNELwithanadditionthat
allocationistrackedbythekmemcontrolgroup.
GFP_NOWAIT:Thisflagisusedforkernelallocationsthatarenon-
blocking.
GFP_NOIO:Thisflagallowstheallocatortobegindirectreclaimon
cleanpagesthatdonotrequirephysicalI/O(swap).
GFP_NOFS:Thisflagallowstheallocatortobegindirectreclaim
butpreventsinvocationoffilesysteminterfaces.
GFP_TEMPORARY:Thisflagisusedwhileallocatingpagesfor
kernelcaches,whicharereclaimablethroughtheappropriate
shrinkerinterface.Thisflagsetsthe__GFP_RECLAIMABLEflagwe
discussedearlier.
GFP_USER:Thisflagisusedforuser-spaceallocations.Memory
allocatedismappedtoauserprocessandcanalsobeaccessedby
kernelservicesorhardwareforDMAtransfersfromdeviceinto
bufferorviceversa.
GFP_DMA:Thisflagcausesallocationfromthelowestzone,called
ZONE_DMA.Thisflagisstillsupportedforbackwardcompatibility.
GFP_DMA32:Thisflagcausesallocationtobeprocessedfrom
ZONE_DMA32whichcontainspagesin<4Gmemory.
GFP_HIGHUSER:Thisflagisusedforuserspaceallocationsfrom
ZONE_HIGHMEM(relevantonlyon32-bitplatforms).
GFP_HIGHUSER_MOVABLE:ThisflagissimilartoGFP_HIGHUSER,
withanadditionthatallocationsarecarriedoutfrommovable
pages,whichenablespagemigrationandreclaim.
GFP_TRANSHUGE_LIGHT:Thiscausestheallocationoftransparent
hugeallocations(THP),whicharecompoundallocations.This
typeflagsets__GFP_COMP,whichwediscussedearlier.
Slaballocator
Asdiscussedinearliersections,thepageallocator(incoordinationwithbuddy
system)doesanefficientjobofhandlingmemoryallocationrequestsin
multiplesofpagesize.However,mostallocationrequestsinitiatedbykernel
codeforitsinternaluseareforsmallerblocks(usuallylessthanapage);
engagingthepageallocatorforsuchallocationsresultsininternal
fragmentation,causingwastageofmemory.Theslaballocatorisimplemented
preciselytoaddressthis;itisbuiltontopofthebuddysystemandisusedto
allocatesmallmemoryblocks,toholdstructureobjectsordatausedbykernel
services.
Designoftheslaballocatorisbasedonanideaofobjectcache.Theconceptof
anobjectcacheisquitesimple:itinvolvesreservingasetoffreepageframes,
dividingandorganizethemintoindependentfreelists(witheachlistcontaining
afewfreepages)calledslabcaches,andusingeachlistforallocationofapool
ofobjectsormemoryblocksofafixedsize,calledaunit.Thisway,eachlistis
assignedauniqueunitsize,andwouldcontainapoolofobjectsormemory
blocksofthatsize.Whenanallocationrequestarrivesforablockofmemoryof
agivensize,theallocatoralgorithmselectsanappropriateslabcachewhoseunit
sizeisthebestfitfortherequestedsize,andreturnstheaddressofafreeblock.
However,atalowlevel,thereisfairbitofcomplexityinvolvedintermsof
initializationandmanagementofslabcaches.Thealgorithmneedstoconsider
variousissuessuchasobjecttracking,dynamicexpansion,andsafereclaim
throughtheshrinkerinterface.Addressingalltheseissuesandachievingaproper
balancebetweenenhancedperformanceandoptimummemoryfootprintisquite
achallenge.Weshallexploremoreonthesechallengesinsubsequentsections,
butfornowwewillcontinueourdiscussionwithallocatorfunctioninterfaces.
Kmalloccaches
Slaballocatormaintainsasetofgenericslabcachestocachememoryblocksof
unitsizesinmultiplesof8.Itmaintainstwosetsofslabcachesforeachunit
size,onetomaintainapoolofmemoryblocksallocatedfromZONE_NORMALpages
andanotherfromZONE_DMApages.Thesecachesareglobalandsharedbyallkernel
code.Userscantrackthestatusofthesecachesthroughaspecialfile
/proc/slabinfo.Kernelservicescanallocateandreleasememoryblocksfromthese
cachesthroughthekmallocfamilyofroutines.Theyarereferredtoaskmalloc
caches:
#cat/proc/slabinfo
slabinfo-version:2.1
#name<active_objs><num_objs><objsize><objperslab><pagesperslab>:tunables<limit><batchcount><sharedfactor>:slabdata<active_slabs><num_slabs><sharedavail>
dma-kmalloc-819200819248:tunables000:slabdata000
dma-kmalloc-409600409688:tunables000:slabdata000
dma-kmalloc-2048002048168:tunables000:slabdata000
dma-kmalloc-1024001024164:tunables000:slabdata000
dma-kmalloc-51200512162:tunables000:slabdata000
dma-kmalloc-25600256161:tunables000:slabdata000
dma-kmalloc-12800128321:tunables000:slabdata000
dma-kmalloc-640064641:tunables000:slabdata000
dma-kmalloc-3200321281:tunables000:slabdata000
dma-kmalloc-1600162561:tunables000:slabdata000
dma-kmalloc-80085121:tunables000:slabdata000
dma-kmalloc-19200192211:tunables000:slabdata000
dma-kmalloc-960096421:tunables000:slabdata000
kmalloc-8192156156819248:tunables000:slabdata39390
kmalloc-4096325352409688:tunables000:slabdata44440
kmalloc-2048110511842048168:tunables000:slabdata74740
kmalloc-1024237424481024164:tunables000:slabdata1531530
kmalloc-51214451520512162:tunables000:slabdata95950
kmalloc-256998810400256161:tunables000:slabdata6506500
kmalloc-19235614053192211:tunables000:slabdata1931930
kmalloc-12835885728128321:tunables000:slabdata1791790
kmalloc-963402340296421:tunables000:slabdata81810
kmalloc-64426724518464641:tunables000:slabdata7067060
kmalloc-321509516000321281:tunables000:slabdata1251250
kmalloc-1664006400162561:tunables000:slabdata25250
kmalloc-86144614485121:tunables000:slabdata12120
kmalloc-96andkmalloc-192arecachesusedtomaintainmemoryblocksalignedwith
thelevel1hardwarecache.Forallocationsabove8k(largeblocks),theslab
allocatorfallsbackonbuddysystem.
Followingarethekmallocfamilyofallocatorroutines;alloftheseneed
appropriateGFPflags:
/**
*kmalloc-allocatememory.
*@size:bytesofmemoryrequired.
*@flags:thetypeofmemorytoallocate.
*/
void*kmalloc(size_tsize,gfp_tflags)
/**
*kzalloc-allocatememory.Thememoryissettozero.
*@size:bytesofmemoryrequired.
*@flags:thetypeofmemorytoallocate.
*/
inlinevoid*kzalloc(size_tsize,gfp_tflags)
/**
*kmalloc_array-allocatememoryforanarray.
*@n:numberofelements.
*@size:elementsize.
*@flags:thetypeofmemorytoallocate(seekmalloc).
*/
inlinevoid*kmalloc_array(size_tn,size_tsize,gfp_tflags)
/**
*kcalloc-allocatememoryforanarray.Thememoryissettozero.
*@n:numberofelements.
*@size:elementsize.
*@flags:thetypeofmemorytoallocate(seekmalloc).
*/
inlinevoid*kcalloc(size_tn,size_tsize,gfp_tflags)
/**
*krealloc-reallocatememory.Thecontentswillremainunchanged.
*@p:objecttoreallocatememoryfor.
*@new_size:bytesofmemoryarerequired.
*@flags:thetypeofmemorytoallocate.
*
*Thecontentsoftheobjectpointedtoarepreserveduptothe
*lesserofthenewandoldsizes.If@pis%NULL,krealloc()
*behavesexactlylikekmalloc().If@new_sizeis0and@pisnota
*%NULLpointer,theobjectpointedtoisfreed
*/
void*krealloc(constvoid*p,size_tnew_size,gfp_tflags)
/**
*kmalloc_node-allocatememoryfromaparticularmemorynode.
*@size:bytesofmemoryarerequired.
*@flags:thetypeofmemorytoallocate.
*@node:memorynodefromwhichtoallocate
*/
void*kmalloc_node(size_tsize,gfp_tflags,intnode)
/**
*kzalloc_node-allocatezeroedmemoryfromaparticularmemorynode.
*@size:howmanybytesofmemoryarerequired.
*@flags:thetypeofmemorytoallocate(seekmalloc).
*@node:memorynodefromwhichtoallocate
*/
void*kzalloc_node(size_tsize,gfp_tflags,intnode)
Followingroutinesreturntheallocatedblocktothefreepool.Callersneedto
ensurethataddresspassedasargumentisofavalidallocatedblock:
/**
*kfree-freepreviouslyallocatedmemory
*@objp:pointerreturnedbykmalloc.
*
*If@objpisNULL,nooperationisperformed.
*
*Don'tfreememorynotoriginallyallocatedbykmalloc()
*oryouwillrunintotrouble.
*/
voidkfree(constvoid*objp)
/**
*kzfree-likekfreebutzeromemory
*@p:objecttofreememoryof
*
*Thememoryoftheobject@ppointstoiszeroedbeforefreed.
*If@pis%NULL,kzfree()doesnothing.
*
*Note:thisfunctionzeroesthewholeallocatedbufferwhichcanbeagood
*dealbiggerthantherequestedbuffersizepassedtokmalloc().Sobe
*carefulwhenusingthisfunctioninperformancesensitivecode.
*/
voidkzfree(constvoid*p)
Objectcaches
Theslaballocatorprovidesfunctioninterfacesforsettingupslabcaches,which
canbeownedbyakernelserviceorasubsystem.Suchcachesareconsidered
privatesincetheyarelocaltokernelservices(orakernelsubsystem)likedevice
drivers,filesystems,processscheduler,andsoon.Thisfacilityisusedbymost
kernelsubsystemstosetupobjectcachesandpoolintermittentlyneededdata
structures.Mostdatastructureswe'veencounteredsofar(sinceChapter1,
ComprehendingProcesses,AddressSpace,andThreads)includingprocess
descriptor,signaldescriptor,pagedescriptor,andsoonaremaintainedinsuch
objectpools.Thepseudofile/proc/slabinfoshowsthestatusofobjectcaches:
#cat/proc/slabinfo
slabinfo-version:2.1
#name<active_objs><num_objs><objsize><objperslab><pagesperslab>:tunables<limit><batchcount><sharedfactor>:slabdata<active_slabs><num_slabs><sharedavail>
sigqueue100100160251:tunables000:slabdata440
bdev_cache7676832194:tunables000:slabdata440
kernfs_node_cache2859428594120341:tunables000:slabdata8418410
mnt_cache489588384212:tunables000:slabdata28280
inode_cache1593215932568284:tunables000:slabdata5695690
dentry8954189817192211:tunables000:slabdata427742770
iint_cache0072561:tunables000:slabdata000
buffer_head5307953430104391:tunables000:slabdata137013700
vm_area_struct4128742400200201:tunables000:slabdata212021200
files_cache207207704234:tunables000:slabdata990
signal_cache4204201088308:tunables000:slabdata14140
sighand_cache2893152112158:tunables000:slabdata21210
task_struct750801358498:tunables000:slabdata89890
Thekmem_cache_create()routinesetsupanewcacheaspertheparameterpassed.
Onsuccess,itreturnstheaddresstothecachedescriptorstructureoftype
kmem_cache:
/*
*kmem_cache_create-Createacache.
*@name:Astringwhichisusedin/proc/slabinfotoidentifythiscache.
*@size:Thesizeofobjectstobecreatedinthiscache.
*@align:Therequiredalignmentfortheobjects.
*@flags:SLABflags
*@ctor:Aconstructorfortheobjects.
*
*Returnsaptrtothecacheonsuccess,NULLonfailure.
*Cannotbecalledwithinainterrupt,butcanbeinterrupted.
*The@ctorisrunwhennewpagesareallocatedbythecache.
*
*/
structkmem_cache*kmem_cache_create(constchar*name,size_tsize,size_talign,
unsignedlongflags,void(*ctor)(void*))
Thecacheiscreatedbyallocatingfreepageframes(frombuddysystem),and
dataobjectsofsizespecified(secondargument)arepopulated.Thougheach
cachestartsbyhostingafixednumberofdataobjectsduringcreation,theycan
growdynamicallywhenrequiredtoaccommodatemorenumberofdataobjects.
Datastructurescanbecomplicated(wehaveencounteredafew),andcan
containvariedelementssuchaslistheaders,sub-objects,arrays,atomic
counters,bit-fields,andsoon.Settingupeachobjectmightrequireallitsfields
tobeinitializedtothedefaultstate;thiscanbeachievedthroughaninitializer
routineassignedtoa*ctorfunctionpointer(lastargument).Theinitializeris
calledforeachnewobjectallocated,bothduringcachecreationandwhenit
growstoaddmorefreeobjects.However,forsimpleobjects,acachecanbe
createdwithoutaninitializer.
Followingisasamplecodesnippetthatshowstheusageofkmem_cache_create():
/*net/core/skbuff.c*/
structkmem_cache*skbuff_head_cache;
skbuff_head_cache=kmem_cache_create("skbuff_head_cache",sizeof(structsk_buff),0,
SLAB_HWCACHE_ALIGN|SLAB_PANIC,NULL);
Flagsareusedtoenabledebugchecks,andenhancetheperformanceofaccess
operationsoncachebyaligningobjectswiththehardwarecache.Thefollowing
flagconstantsaresupported:
SLAB_CONSISTENCY_CHECKS/*DEBUG:Perform(expensive)checksoalloc/free*/
SLAB_RED_ZONE/*DEBUG:Redzoneobjsinacache*/
SLAB_POISON/*DEBUG:Poisonobjects*/
SLAB_HWCACHE_ALIGN/*Alignobjsoncachelines*/
SLAB_CACHE_DMA/*UseGFP_DMAmemory*/
SLAB_STORE_USER/*DEBUG:Storethelastownerforbughunting*/
SLAB_PANIC/*Panicifkmem_cache_create()fails*/
Subsequently,objectscanbeallocatedandreleasedthroughrelevantfunctions.
Uponrelease,objectsareputbackintothefreelistofthecache,makingthem
availableforreuse;thisresultsinapossibleperformanceboost,particularly
whenobjectsarecachehot:
/**
*kmem_cache_alloc-Allocateanobject
*@cachep:Thecachetoallocatefrom.
*@flags:GFPmask.
*
*Allocateanobjectfromthiscache.Theflagsareonlyrelevant
*ifthecachehasnoavailableobjects.
*/
void*kmem_cache_alloc(structkmem_cache*cachep,gfp_tflags);
/**
*kmem_cache_alloc_node-Allocateanobjectonthespecifiednode
*@cachep:Thecachetoallocatefrom.
*@flags:GFPmask.
*@nodeid:nodenumberofthetargetnode.
*
*Identicaltokmem_cache_allocbutitwillallocatememoryonthegiven
*node,whichcanimprovetheperformanceforcpuboundstructures.
*
*Fallbacktoothernodeispossibleif__GFP_THISNODEisnotset.
*/
void*kmem_cache_alloc_node(structkmem_cache*cachep,gfp_tflags,intnodeid);
/**
*kmem_cache_free-Deallocateanobject
*@cachep:Thecachetheallocationwasfrom.
*@objp:Thepreviouslyallocatedobject.
*
*Freeanobjectwhichwaspreviouslyallocatedfromthis
*cache.
*/
voidkmem_cache_free(structkmem_cache*cachep,void*objp);
kmemcachescanbedestroyedwhenallhosteddataobjectsarefree(notinuse),
bycallingkmem_cache_destroy().
Cachemanagement
Allslabcachesaremanagedinternallybyslabcore,whichisalow-level
algorithm.Itdefinesvariouscontrolstructuresthatdescribethephysicallayout
foreachcachelist,andimplementscorecache-managementoperationswhich
areinvokedbyinterfaceroutines.Theslaballocatorwasoriginallyimplemented
inSolaris2.4kernels,andusedbymostother*nixkernels,basedonapaperby
Bonwick.
Traditionally,Linuxwasusedonuniprocessordesktopandserversystemswith
moderatememories,andthekerneladoptedtheclassicmodelofBonwickwith
appropriateperformanceimprovements.Overtheyears,duetodiversityofthe
platformswithdistinctprioritiesforwhichtheLinuxkernelisportedandused,it
turnsoutthattheclassicimplementationoftheslabcorealgorithmisinefficient
tocatertoalltheneeds.Whilememory-constrainedembeddedplatformscannot
affordthehigherfootprintoftheallocator(spaceusedtomanagemetadataand
densityofallocatoroperations),SMPsystemswithhugememoriesneed
consistentperformance,scalability,andbettermechanismstogeneratetraceand
debuginformationonallocations.
Tocatertothesedissimilarrequirements,currentversionsofthekernelprovide
threedistinctimplementationsoftheslabalgorithm:slob,aclassicK&Rtype
listallocator,designedforlow-memorysystemswithscarceallocationneeds,
andwasdefaultobjectallocatorforLinuxduringitsinitialyears(1991-1999);
slab,aclassicSolaris-styleslaballocatorthathasbeenaroundinLinuxsince
1999;andslub,improvedforcurrentgenerationSMPhardwarewithhuge
memories,anddeliversconsistentperformancewithbettercontrolanddebug
mechanisms.Thedefaultkernelconfigurationformostarchitecturesenables
slubasdefaultslaballocator;thiscanbechangedduringkernelbuildthrough
kernelconfigurationoptions.
CONFIG_SLAB:Theregularslaballocatorthatisestablishedand
knowntoworkwellinallenvironments.Itorganizescachehot
objectsinper-CPUandpernodequeues.
CONFIG_SLUB:SLUBisaslaballocatorthatminimizescacheline
usageinsteadofmanagingqueuesofcachedobjects(SLAB
approach).per-CPUcachingisrealizedusingslabsofobjects
insteadofqueuesofobjects.SLUBcanusememoryefficientlyand
hasenhanceddiagnostics.SLUBisthedefaultchoiceforaslab
allocator.
CONFIG_SLOB:SLOBreplacesthestockallocatorwithadrastically
simplerallocator.SLOBisgenerallymorespaceefficientbutdoes
notperformaswellonlargesystems.
Irrespectiveofthetypeofallocatorchosen,theprogramminginterfaceremains
unchanged.Infact,atlowlevel,allthreeallocatorssharesomecommoncode
base:
Weshallnowlookintophysicallayoutofacacheanditscontrolstructures.
Cachelayout-generic
Eachcacheisrepresentedbyacachedescriptorstructurekmem_cache;thisstructure
containsallcrucialmetadataofthecache.Itincludesalistofslabdescriptors,
eachhostingapageoragroupofpageframes.Pagesunderslabscontainobjects
ormemoryblocks,whicharetheallocationunitsofthecache.Theslab
descriptorpointstoalistofobjectscontainedinthepagesandtrackstheirstate.
Aslabmaybeinoneofthreepossiblestates--full,partialorempty--basedonthe
stateoftheobjectsitishosting.Aslabisconsideredfullwhenallitsobjectsare
inusewithnofreeobjectsleftforallocation.Aslabwithatleastonefreeobject
isconsideredtobeinpartialstate,andthosewithallobjectsinfreestateare
consideredempty.
Thisarrangementenablesquickobjectallocations,sinceallocatorroutinescan
lookuptothepartialslabforafreeobject,andpossiblymoveontoanempty
slabifrequired.Italsohelpseasierexpansionofthecachewithnewpageframes
toaccommodatemoreobjects(whenrequired),andfacilitatessafeandquick
reclaims(slabsinemptystatecanbereclaimed).
Slubdatastructures
Havinglookedatthelayoutofacacheanddescriptorsinvolvedatageneric
level,let'spushfurthertoviewspecificdatastructuresusedbythesluballocator
andexplorethemanagementoffreelists.Aslubdefinesitsversionofcache
descriptor,structkmem_cache,inkernelheader/include/linux/slub-def.h:
structkmem_cache{
structkmem_cache_cpu__percpu*cpu_slab;
/*Usedforretrivingpartialslabsetc*/
unsignedlongflags;
unsignedlongmin_partial;
intsize;/*Thesizeofanobjectincludingmetadata*/
intobject_size;/*Thesizeofanobjectwithoutmetadata*/
intoffset;/*Freepointeroffset.*/
intcpu_partial;/*Numberofpercpupartialobjectstokeeparound*/
structkmem_cache_order_objectsoo;
/*Allocationandfreeingofslabs*/
structkmem_cache_order_objectsmax;
structkmem_cache_order_objectsmin;
gfp_tallocflags;/*gfpflagstouseoneachalloc*/
intrefcount;/*Refcountforslabcachedestroy*/
void(*ctor)(void*);
intinuse;/*Offsettometadata*/
intalign;/*Alignment*/
intreserved;/*Reservedbytesattheendofslabs*/
constchar*name;/*Name(onlyfordisplay!)*/
structlist_headlist;/*Listofslabcaches*/
intred_left_pad;/*Leftredzonepaddingsize*/
...
...
...
structkmem_cache_node*node[MAX_NUMNODES];
};
Thelistelementreferstoalistofslabcaches.Whenanewslabisallocated,itis
storedonalistinthecachedescriptor,andisconsideredempty,sinceallits
objectsarefreeandavailable.Uponallocationofanobject,theslabturnsinto
partialstate.Partialslabsaretheonlytypeofslabsthattheallocatorneedsto
keeptrackofandareconnectedinalistinsidethekmem_cachestructure.TheSLUB
allocatorhasnointerestintrackingfullslabswhoseobjectshaveallbeen
allocated,oremptyslabswhoseobjectsarefree.SLUBtrackspartialslabsfor
eachnodethroughanarrayofpointersoftypestructkmem_cache_node[MAX_NUMNODES],
whichencapsulatesalistofpartialslabs:
structkmem_cache_node{
spinlock_tlist_lock;
...
...
#ifdefCONFIG_SLUB
unsignedlongnr_partial;
structlist_headpartial;
#ifdefCONFIG_SLUB_DEBUG
atomic_long_tnr_slabs;
atomic_long_ttotal_objects;
structlist_headfull;
#endif
#endif
};
Allfreeobjectsinaslabformalinkedlist;whenallocationrequestsarrive,the
firstfreeobjectisremovedfromthelistanditsaddressisreturnedtothecaller.
Trackingfreeobjectsthroughalinkedlistrequiressignificantmetadata;while
thetraditionalSLABallocatormaintainedmetadataforallpagesofaslabwithin
theslabheader(causingdataalignmentissues),SLUBmaintainsper-page
metadataforpagesinaslabbycrammingmorefieldsintothepagedescriptor
structure,therebyeliminatingmetadatafromtheslabhead.SLUBmetadata
elementsinthepagedescriptorareonlyvalidwhenthecorrespondingpageis
partofaslab.PagesengagedforslaballocationshavethePG_slabflagset.
ThefollowingarefieldsofthepagedescriptorrelevanttoSLUB:
structpage{
...
...
union{
pgoff_tindex;/*Ouroffsetwithinmapping.*/
void*freelist;/*sl[aou]bfirstfreeobject*/
};
...
...
struct{
union{
...
struct{/*SLUB*/
unsignedinuse:16;
unsignedobjects:15;
unsignedfrozen:1;
};
...
};
...
};
...
...
union{
...
...
structkmem_cache*slab_cache;/*SL[AU]B:Pointertoslab*/
};
...
...
};
Thefreelistpointerreferstothefirstfreeobjectinthelist.Eachfreeobjectis
composedofametadataareathatcontainapointertothenextfreeobjectinthe
list.indexholdstheoffsettothemetadataareaofthefirstfreeobject(containsa
pointertonextfreeobject).Themetadataareaoflastfreeobjectwouldcontain
thenextfreeobjectpointersettoNULL.inusecontainsthetotalcountof
allocatedobjects,andobjectscontainsthetotalnumberofobjects.frozenisaflag
thatisusedasapagelock:ifapagehasbeenfrozenbyaCPUcore,onlythat
corecanretrievefreeobjectsfromthepage.slab_cacheisapointertothekmem
cachecurrentlyusingthispage:
Whenanallocationrequestarrives,thefirstfreeobjectislocatedthroughthe
freelistpointer,andisremovedfromthelistbyreturningitsaddresstothecaller.
Theinusecounterisalsoincrementedtoindicateanincreaseinthenumberof
allocatedobjects.Thefreelistpointeristhenupdatedwiththeaddressofthe
nextfreeobjectinthelist.
Forachievingenhancedallocationefficiency,eachCPUisassignedaprivate
active-slablist,whichcomprisesapartial/freeslablistforeachobjecttype.
TheseslabsarereferredtoasCPUlocalslabs,andaretrackedbystruct
kmem_cache_cpu:
structkmem_cache_cpu{
void**freelist;/*Pointertonextavailableobject*/
unsignedlongtid;/*Globallyuniquetransactionid*/
structpage*page;/*Theslabfromwhichweareallocating*/
structpage*partial;/*Partiallyallocatedfrozenslabs*/
#ifdefCONFIG_SLUB_STATS
unsignedstat[NR_SLUB_STAT_ITEMS];
#endif
};
Whenanallocationrequestarrives,theallocatortakesthefastpathandlooks
intothefreelistoftheper-CPUcache,anditthenreturnsfreeobjects.Thisis
referredasthefastpathsinceallocationsarecarriedoutthroughinterrupt-safe
atomicinstructionsthatdoesnotrequirelockcontention.Whenthefastpath
fails,theallocatortakestheslowpathandlooksthroughpageandpartiallistsof
thecpucachesequentially.Ifnofreeobjectsarefound,theallocatormovesinto
thepartiallistsofnodes;thisoperationrequirestheallocatortocontendfor
appropriateexclusionlock.Onfailure,theallocatorgetsanewslabfromthe
buddysystem.Fetchingfromeithernodelistsoracquiringanewslabfrom
buddysystemareconsideredveryslowpaths,sincebothoftheseoperationsare
notdeterministic.
Thefollowingdiagramdepictstherelationshipbetweenslubdatastructuresand
freelists:
Vmalloc
Pageandslaballocatorsbothallocatephysicallycontiguousblocksofmemory,
mappedtocontiguouskerneladdressspace.Mostofthetime,kernelservices
andsubsystemsprefertoallocatephysicallycontiguousblocksforexploiting
caching,addresstranslation,andotherperformance-relatedbenefits.
Nonetheless,allocationrequestsforverylargeblocksmightfaildueto
fragmentationofphysicalmemory,andtherearefewsituationsthatnecessitate
allocationoflargeblocks,suchassupportfordynamicallyloadablemodules,
swapmanagementoperations,largefilecachesandsoon.
Asasolution,thekernelprovidesvmalloc,afragmentedmemoryallocatorthat
attemptstoallocatememory,byjoiningphysicallyscatteredmemoryregions
throughvirtuallycontiguousaddressspace.Arangeofvirtualaddresseswithin
thekernelsegmentarereservedforvmallocmappings,calledvmallocaddress
space.Totalmemorythatcanbemappedthroughthevmallocinterfacedepends
onthesizeofthevmallocaddressspace,whichisdefinedbyarchitecture-
specifickernelmacrosVMALLOC_STARTandVMALLOC_END;forx86-64systems,thetotal
rangeofvmallocaddressspaceisastaggering32TB.However,ontheflipside,
thisrangeistoolittleformost32-bitarchitectures(amere12oMB).Recent
kernelversionsusethevmallocrangeforsettingupavirtuallymappedkernel
stack(x86-64only),whichwediscussedinthefirstchapter.
Followingareinterfaceroutinesforvmallocallocationsanddeallocations:
/**
*vmalloc-allocatevirtuallycontiguousmemory
*@size:-allocationsize
*Allocateenoughpagestocover@sizefromthepagelevel
*allocatorandmapthemintocontiguouskernelvirtualspace.
*
*/
void*vmalloc(unsignedlongsize)
/**
*vzalloc-allocatevirtuallycontiguousmemorywithzerofill
1*@size:allocationsize
*Allocateenoughpagestocover@sizefromthepagelevel
*allocatorandmapthemintocontiguouskernelvirtualspace.
*Thememoryallocatedissettozero.
*
*/
void*vzalloc(unsignedlongsize)
/**
*vmalloc_user-allocatezeroedvirtuallycontiguousmemoryforuserspace
*@size:allocationsize
*Theresultingmemoryareaiszeroedsoitcanbemappedtouserspace
*withoutleakingdata.
*/
void*vmalloc_user(unsignedlongsize)
/**
*vmalloc_node-allocatememoryonaspecificnode
*@size:allocationsize
*@node:numanode
*Allocateenoughpagestocover@sizefromthepagelevel
*allocatorandmapthemintocontiguouskernelvirtualspace.
*
*/
void*vmalloc_node(unsignedlongsize,intnode)
/**
*vfree-releasememoryallocatedbyvmalloc()
*@addr:memorybaseaddress
*Freethevirtuallycontinuousmemoryareastartingat@addr,as
*obtainedfromvmalloc(),vmalloc_32()or__vmalloc().If@addris
*NULL,nooperationisperformed.
*/
voidvfree(constvoid*addr)
/**
*vfree_atomic-releasememoryallocatedbyvmalloc()
*@addr:memorybaseaddress
*Thisoneisjustlikevfree()butcanbecalledinanyatomiccontextexceptNMIs.
*/
voidvfree_atomic(constvoid*addr)
Mostkerneldevelopersavoidvmallocallocationsduetoallocationoverheads
(sincethosearenotidentitymappedandrequirespecificpagetabletweaks,
resultinginTLBflushes)andperformancepenaltiesinvolvedduringaccess
operations.
ContiguousMemoryAllocator
(CMA)
Albeitwithsignificantoverheads,virtuallymappedallocationssolvethe
problemoflargememoryallocationstoagreaterextent.However,therearea
fewscenariosthatmandatetheallocationofphysicallycontiguousbuffers.DMA
transfersareonesuchcase.Devicedriversoftenfindastringentneedfor
physicallycontiguousbufferallocations(forsettingupDMAtransfers),which
arecarriedoutthroughanyofthephysicallycontiguousallocatorsdiscussed
earlier.
However,driversdealingwithspecificclassesofdevicessuchasmultimedia
oftenfindthemselvessearchingforhugeblocksofcontiguousmemory.Tomeet
thisend,overtheyears,suchdrivershavebeenreservingmemoryduringsystem
bootthroughthekernelparametermem,whichallowssettingasideenough
contiguousmemoryatboot,whichcanberemappedintolinearaddressspace
duringdriverruntime.Thoughvaluable,thisstrategyhasitslimitations:first,
suchreservedmemoriesliemomentarilyunusedwhenthecorrespondingdevice
isnotinitiatingaccessoperations,andsecond,dependingonthenumberof
devicestobesupported,thesizeofreservedmemoriesmightincrease
substantially,whichmightseverelyimpactsystemperformanceduetocramped
physicalmemory.
AcontiguousMemoryAllocator(CMA)isakernelmechanismintroducedto
effectivelymanagereservedmemories.ThecruxofCMAistobringinreserved
memoriesundertheallocatoralgorithm,andsuchmemoryisreferredtoasCMA
area.CMAallowsallocationsfromtheCMAareaforbothdevices'andsystem's
use.Thisisachievedbybuildingapagedescriptorlistforpagesinreserve
memory,andenumeratingitintothebuddysystem,whichenablesallocationof
CMApagesthroughthepageallocatorforregularneeds(kernelsubsystems)and
throughDMAallocationroutinesfordevicedrivers.
However,itmustbeensuredthatDMAallocationsdonotfailduetotheusageof
CMApagesforotherpurposes,andthisistakencarethroughthemigratetype
attribute,whichwediscussedearlier.PagesenumeratedbyCMAintobuddy
systemareassignedtheMIGRATE_CMAproperty,whichindicatesthatpagesare
movable.Whileallocatingmemoryfornon-DMApurposes,thepageallocator
canuseCMApagesonlyformovableallocations(recallthatsuchallocations
canbemadethroughthe__GFP_MOVABLEflag).WhenaDMAallocationrequest
arrives,CMApagesheldbykernelallocationsaremovedoutofthereserved
region(throughapage-migrationmechanism),resultingintheavailabilityof
memoryforthedevicedriver'suse.Further,whenpagesareallocatedforDMA,
theirmigratetypeischangedfromMIGRATE_CMAtoMIGRATE_ISOLATE,makingthem
invisibletothebuddysystem.
ThesizeoftheCMAareacanbechosenduringkernelbuildthroughits
configurationinterface;optionally,itcanalsobepassedthroughthekernel
parametercma=.
Summary
WehavetraversedthroughoneofthemostcrucialaspectsoftheLinuxkernel,
comprehendingvariousnuancesofmemoryrepresentationsandallocations.By
understandingthissubsystem,wehavealsosuccinctlycapturedthedesign
acumenandimplementationefficiencyofthekernel,andmoreimportantly
understoodthekernel'sdynamisminaccommodatingfinerandnewerheuristics
andmechanismsforcontinuousenhancements.Apartfromthespecificsof
memorymanagement,wealsogaugedtheefficiencyofthekernelinmaximizing
resourceusageatminimalcosts,usheringallclassicalmechanismsofcodereuse
andmodularcodestructures.
Thoughthespecificsofmemorymanagementmayvaryincorrespondencetothe
underlyingarchitecture,thegeneralitiesofdesignandimplementationstyles
wouldmostlyremainthesametoachievecodestabilityandsensitivityto
change.
Inthenextchapter,wewillgofurtherandlookatanotherfundamental
abstractionofthekernel:files.WewilllookthroughfileI/Oandexploreits
architectureandimplementationdetails.
FilesystemsandFileI/O
Thusfarwehavetraversedacrosstheelementalresourcesofthekernel,suchas
addressspaces,processortime,andphysicalmemory.Wehavebuiltan
empiricalunderstandingofprocessmanagement,CPUscheduling,andmemory
managementandthecrucialabstractionstheyprovide.Weshallcontinueto
buildourunderstandinginthischapterbylookingatanotherkeyabstraction
providedbythekernel,thefileI/Oarchitecture.Wewilllookindetailataspects
suchas:
Filesystemimplementation
FileI/O
VFS
VFSdatastructures
Specialfilesystems
Computingsystemsexistforthesolepurposeofprocessingdata.Most
algorithmsaredesignedandprogrammedtoextractdesiredinformationfrom
acquireddata.Datawhichfuelsthisprocessmustbestoredpersistentlyfor
continuousaccess,mandatingstoragesystemstobeengineeredtocontain
informationsafelyforlongerperiodsoftime.Forusershoweverit'sthe
operatingsystemwhichfetchesdatafromthesestoragedevicesandmakesit
availableforprocessing.Thekernel'sfilesystemisthecomponentthatserves
thispurpose.
Filesystem-high-levelview
Filesystemsabstractthephysicalviewofstoragedevicesfromusers,and
virtualizestorageareaonadiskforeachvaliduserofthesystemthrough
abstractcontainerscalledfilesanddirectories.Filesserveascontainersforuser
dataanddirectoriesactascontainerstoagroupofuserfiles.Insimplewords,
operatingsystemsvirtualizeaviewofastoragedeviceforeachuserasasetof
directoriesandfiles.Filesystemservicesimplementroutinestocreate,organize,
store,andretrievefiles,andtheseoperationsareinvokedbyuserapplications
throughappropriatesystemcallinterfaces.
Wewillbeginthisdiscussionbylookingatthelayoutofasimplefilesystem,
designedtomanageastandardmagneticstoragedisk.Thisdiscussionwillhelp
uscomprehendkeytermsandconceptsrelatedtodiskmanagementingeneral.A
typicalfilesystemimplementationhoweverinvolvesappropriatedatastructures
whichdescribetheorganizationoffiledataondisk,andoperationswhichenable
applicationstoexecutefileI/O.
Metadata
Astoragedisktypicallyiscomposedofphysicalblocksofidenticalsizecalled
sectors;sizeofasectorisusually512bytesorinmultiples,dependingontype
andcapacityofstorage.AsectoristheminimalunitofI/Oonthedisk.Whena
diskispresentedtothefilesystemformanagement,itperceivesstorageareaas
anarrayofblocksoffixedsize,whereeachblockisidenticaltoasectoror
multiplesofsectorsize.Typicaldefaultblocksizeis1024bytesandcanvaryas
perdiskcapacityandfilesystemtype.Blocksizeisconsideredtheminimalunit
ofI/Obyafilesystem:
Inode(indexnode)
Thefilesystemneedstomaintainmetadatatoidentifyandtrackvarious
attributesforeachfileanddirectorycreatedbyuser.Thereareseveralelements
ofmetadatathatdescribeafilesuchasfilename,typeoffile,lastaccess
timestamp,owner,accessprivileges,lastmodificationtimestamp,creationtime,
sizeoffiledata,andreferencestodiskblockscontainingfiledata.
Conventionally,filesystemsdefineastructurecalledinodetocontainall
metadataofafile.Thesizeandtypeofinformationcontainedininodeis
filesystemspecificandmaylargelyvarybasedonthefunctionalitiesitsupports.
Eachinodeisidentifiedbyauniquenumberreferredtoasanindex,whichis
consideredalow-levelnameofthefile:
Filesystemsreserveafewdiskblocksforstoringinodeinstancesandtherestfor
storingcorrespondingfiledata.Thenumberofblocksreservedforstoringinodes
dependonthestoragecapacityofthedisk.Theon-disklistofnodesheldin
inodeblocksisreferredtoastheinodetable.Filesystemswouldneedtotrack
thestatusoftheinodeanddatablockstoidentifyfreeblocks.Thisisgenerally
achievedthroughbitmaps,abitmapfortrackingfreeinodesandanothertotrack
freedatablocks.Thefollowingdiagramshowsthetypicallayoutwithbitmap,
inode,anddatablocks:
Datablockmap
Asmentionedbefore,eachinodeshouldrecordthelocationsofdatablocksin
whichcorrespondingfiledataisstored.Dependingonthelengthoffiledata,
eachfilemightoccupynnumberofdatablocks.Therearevariousmethodsused
totrackdatablockdetailsinaninode;thesimplestbeingdirectreferences,
whichinvolvestheinodecontainingdirectpointerstodatablocksofthefile.
Thenumberofsuchdirectpointerswoulddependonfilesystemdesign,and
mostimplementationschoosetoengagefewerbytesforsuchpointers.This
methodisproductiveforsmallfileswhichspanacrossafewdatablocks(usually
<16k),butlackssupportforlargefilesspreadacrossnumerousdatablocks:
Tosupportlargefiles,filesystemsengageanalternatemethodcalledmulti-level
indexingwhichinvolvesindirectpointers.Thesimplestimplementationwould
haveanindirectpointeralongwithafewdirectpointersinaninodestructure.
Anindirectpointerreferstoablockcontainingdirectpointerstodatablocks
ofthefile.Whenafilegrowstoolargetobereferredthroughdirectpointersof
theinode,afreedatablockisengagedwithdirectpointersandtheindirect
pointeroftheinodeisreferredtoit.Thedatablockreferredtobyanindirect
pointeriscalledindirectblock.Thenumberofdirectpointersinanindirect
blockcanbedeterminedbyblocksizedividedbythesizeofblockaddresses;for
instance,ona32-bitfilesystemwith4-byte(32bits)wideblockaddressesand
1024blocksize,eachindirectblockcancontainupto256entries,whereasina
64-bitfilesystemwith8-byte(64bits)wideblockaddresses,eachindirectblock
cancontainupto128directpointers:
Thistechniquecanbefurtheredtosupportevenlargerfilesbyengaginga
double-indirectpointer,whichreferstoablockcontainingindirectpointers
witheachentryreferringtoablockcontainingdirectpointers.Assuminga64-bit
filesystemwith1024blocksize,witheachblockaccommodating128entries,
therewouldbe128indirectpointerseachpointingtoablockholding128direct
pointers;thuswiththistechniqueafilesystemcansupportafilethatcanspanup
to16,384(128x128)datablocks,whichis16MB.
Further,thistechniquecanbeextendedwithatriple-indirectionpointer,
resultinginevenmoremetadatatobemanagedbyfilesystems.However,despite
ofmulti-levelindexing,increasingfilesystemblocksizewithreductioninblock
addresssizeisthemostrecommendedandefficientsolutiontosupportlarger
files.Userswillneedtochoosetheappropriateblocksizewhileinitializinga
diskwithafilesystem,toensurepropersupportforlargerfiles.
Somefilesystemsuseadifferentapproachcalledextentstostoredatablock
informationinaninode.Anextentisapointerthatreferstothestartdatablock
(similartothatofadirectpointer)withaddedlengthbitsthatspecifythecount
ofcontiguousblockswherefiledataisstored.Dependingonfilesizeanddisk
fragmentationlevels,asingleextentmightnotbesufficienttorefertoalldata
blocksofthefile,andtohandlesucheventualities,filesystemsbuildextentlists
witheachextentreferringtothestartaddressandlengthofoneregionof
contiguousdatablocksondisk.
Theextentsapproachreducesmetadatathatfilesystemsneedtomanagetostore
datablockmapsbyasignificantvolume,butthisisrealizedatthecostof
flexibilityinfilesystemoperations.Forinstance,considerareadoperationtobe
performedataspecificfilepositionofalargefile:tolocateadatablockof
specifiedfileoffsetposition,thefilesystemmustbeginwiththefirstextentand
scanthroughthelistuntilitfindstheextentthatcoverstherequiredfileoffset.
Directories
Filesystemsconsideradirectoryasaspecialfile.Theyrepresentadirectoryora
folderwithanon-diskinode.Theyaredifferentiatedfromnormalfileinodes
throughthetypefield,whichismarkedasdirectory.Eachdirectoryisassigned
datablockswhereitholdsinformationaboutfilesandsubdirectoriesitcontains.
Adirectorymaintainsrecordsoffiles,andeachrecordincludesthefilename,
whichisanamestringnotexceedingaspecificlengthasdefinedbythe
filesystem'snamingpolicy,andtheinodenumberassociatedwiththefile.For
efficientmanagement,filesystemimplementationsdefinethelayoutoffile
recordscontainedinadirectorythroughappropriatedatastructuressuchas
binarytrees,lists,radixtrees,andhashtables:
Superblock
Apartfromstoringinodesthatcapturesmetadataofindividualfiles,filesystems
alsoneedtomaintainmetadatapertainingtodiskvolumeasawhole,suchas
sizeofthevolume,totalblockcount,currentstateoffilesystem,countofinode
blocks,countofinodes,countofdatablocks,startinodeblocknumber,and
filesystemsignature(magicnumber)foridentity.Thesedetailsarecapturedina
datastructurecalledsuperblock.Duringinitializationoffilesystemondisk
volume,thesuperblockisorganizedatstartofdiskstorage.Thefollowing
diagramillustratesthecompletelayoutofdiskstoragewithsuperblocks:
Operations
Whiledatastructuresmakeupelementaryconstituentsofafilesystemdesign,
theoperationspossibleonthosedatastructurestorenderfileaccessand
manipulationoperationsmakesthecorefeatureset.Thenumberofoperations
andtypeoffunctionalitiessupportedarefilesystemimplementationspecific.
Followingisagenericdescriptionofafewcommonoperationsthatmost
filesystemsprovide.
Mountandunmountoperations
Mountisanoperationofenumeratinganon-disksuperblockandmetadatainto
memoryforthefilesystem'suse.Thisprocesscreatesin-memorydatastructures
thatdescribefilemetadataandpresentthehostoperatingsystemwithaviewof
thedirectoryandfilelayoutinthevolume.Themountoperationisimplemented
tocheckconsistencyofdiskvolume.Asdiscussedearlier,thesuperblock
containsthestateofthefilesystem;itindicateswhetherthevolumeisconsistent
ordirty.Ifthevolumeiscleanorconsistent,amountoperationwouldsucceed,
andifthevolumeismarkedasdirtyorinconsistent,itreturnswiththe
appropriatefailurestatus.
Anabruptshutdowncausesfilesystemstatetobedirty,andrequiresconsistency
checkbeforeitcanbemarkedforuseagain.Mechanismsadoptedfor
consistencychecksarecomplexandtimeconsuming;suchoperationsare
filesystemimplementationspecific,andmostsimpleonesprovidespecifictools
forconsistencyandchecks,andothermodernimplementationsusejournaling.
Unmountisanoperationofflushingthein-memorystateoffilesystemdata
structuresbacktodisk.Thisoperationcausesallmetadataandfilecachestobe
synchronizedwithdiskblocks.Unmountmarksthefilesystemstateinthe
superblockasconsistent,indicatinggracefulshutdown.Inotherwords,theon-
disksuperblockstateremainsdirtyuntilunmountisexecuted.
Filecreationanddeletionoperations
Creationofafileisanoperationthatrequiresinstantiationofanewinodewith
appropriateattributes.Userprogramsinvokethefilecreationroutinewith
chosenattributessuchasfilename,directoryunderwhichfileistobecreated,
accesspermissionsforvarioususers,andfilemodes.Thisroutinealsoinitializes
otherspecificfieldsofinodesuchascreationtimestampandfileownership
information.Thisoperationwritesanewfilerecordintothedirectoryblock,
describingthefilenameandinodenumber.
Whenauserapplicationinitiatesadeleteoperationonavalidfile,thefilesystem
removesthecorrespondingfilerecordfromthedirectoryandchecksthefile's
referencecounttodeterminethenumberofprocessescurrentlyusingthefile.
Deletionofafilerecordfromadirectorypreventsotherprocessesfromopening
thefilethatismarkedfordeletion.Whenallcurrentreferencestoafileare
closed,allresourcesassignedtothefilearereleasedbyreturningitsdatablocks
tothelistoffreedatablocks,andinodetolistoffreeinodes.
Fileopenandcloseoperations
Whenauserprocessattemptstoopenafile,itinvokestheopenoperationofthe
filesystemwithappropriatearguments,whichincludepathandnameofthefile.
Thefilesystemtraversesthroughdirectoriesspecifiedinthepathuntilitreaches
theimmediateparentdirectorythatcontainstherequestedfile'srecord.Lookup
intothefilerecordproducestheinodenumberofthespecifiedfile.However,
specificlogicandefficiencyoflookupoperationdependsonthedatastructure
chosenbytheparticularfilesystemimplementationfororganizingfilerecordsin
adirectoryblock.
Oncethefilesystemretrievestherelatedinodenumberofthefile,itinitiates
appropriatesanitycheckstoenforceaccesscontrolvalidationonthecalling
context.Ifthecallerprocessisclearedforfileaccess,thefilesystemthen
instantiatesanin-memorystructurecalledfiledescriptortomaintainfileaccess
stateandattributes.Uponsuccessfulcompletion,theopenoperationreturnsthe
referenceofthefiledescriptorstructuretothecallerprocess,whichservesasa
handletothefileforthecallerprocesstoinitiateotherfileoperationssuchas
read,write,andclose.
Uponinitiatingacloseoperation,thefiledescriptorstructureisdestroyedandthe
file'sreferencecountisdecremented.Thecallerprocesswillnolongerbeableto
initiateanyotherfileoperationuntilitcanopenthefilealloveragain.
Filereadandwriteoperations
Whenuserapplicationsinitiatereadonafilewithappropriatearguments,the
underlyingfilesystem'sreadroutineisinvoked.Operationsbeginwithalookup
intothefile'sdatablockmaptolocatetheappropriatedatadisksectortoberead;
itthenallocatesapagefromthepagecacheandschedulesdiskI/O.On
completionofI/Otransfer,thefilesystemmovesrequesteddataintothe
application'sbufferandupdatesthefileoffsetpositioninthecaller'sfile
descriptorstructure.
Similarly,thewriteoperationofthefilesystemretrievesdatapassedfromuser
bufferandwritesitintotheappropriateoffsetoffilebufferinthepagecache,
andmarksthepagewiththePG_dirtyflag.However,whenthewriteoperationis
invokedtoappenddataattheendofthefile,newdatablocksmightberequired
forthefiletogrow.Thefilesystemlooksforfreedatablocksondisk,and
allocatesthemforthisfile,beforeproceedingwithwrite.Allocatingnewdata
blockswouldneedchangestotheinodestructure'sdatablockmapandallocation
ofnewpage(s)frompagecachemappedtothenewdatablocksallocated.
Additionalfeatures
Thoughthefundamentalcomponentsofafilesystemremainsimilar,theway
dataisorganizedandtheheuristicstoaccessdataisimplementationdependent.
Designersconsiderfactorssuchasreliability,security,typeandcapacityof
storagevolume,andI/Oefficiencytoidentifyandsupportfeaturesthatenhance
capabilitiesofafilesystem.Followingarefewextendedfeaturesthatare
supportedbymodernfilesystems.
Extendedfileattributes
Generalfileattributestrackedbyafilesystemimplementationaremaintainedin
aninodeandinterpretedbyappropriateoperations.Extendedfileattributesarea
featurethatenablesuserstodefinecustommetadataforafile,whichisnot
interpretedbythefilesystem.Suchattributesareoftenusedtostorevarious
typesofinformationwhichdependonthetypeofdatathefilecontains.For
instance,documentfilescandefinetheauthornameandcontactdetails,web
filescanspecifyURLofthefileandothersecurity-relatedattributessuchas
digitalcertificatesandcryptohashkeys.Similartonormalattributes,each
extendedattributeisidentifiedbyanameandavalue.Ideally,mostfilesystems
donotimposerestrictionsonthenumberofsuchextendedattributes.
Somefilesystemsalsoprovideafacilityofindexingtheattributes,whichaidsin
quicklookupforrequiredtypeofdatawithouthavingtonavigatefilehierarchy.
Forinstance,assumethatfilesareassignedwithanextendedattributecalled
Keywords,whichrecordskeywordvaluesthatdescribefiledata.Withindexing,
theusercouldissuequeriestofindthelistoffilesmatchingspecifickeywords
throughappropriatescripts,regardlessofthefile'slocation.Thus,indexing
offersapowerfulalternativeinterfacetothefilesystem.
Filesystemconsistencyandcrash
recovery
Consistencyofanon-diskimageiscriticalforreliablefunctioningofa
filesystem.Whilethefilesystemisintheprocessofupdatingitson-disk
structures,thereiseverypossibilityforacatastrophicerrortooccur(power
down,OScrash,andsoon),causinginterruptionofapartiallycommittedcritical
update.Thisresultsincorruptionofon-diskstructuresandleavesthefilesystem
inaninconsistentstate.Dealingwithsucheventualities,byengaginganeffective
strategyforcrashrecovery,isoneofthemajorchallengesfacedbymost
filesystemdesigners.
Somefilesystemshandlecrashrecoverythroughaspeciallydesignedfilesystem
consistencychecktoollikefsck(awidelyusedUnixtool).Itisrunatsystem
bootbeforemountandscansthroughon-diskfilesystemstructureslookingfor
inconsistencies,andfixesthemwhenfound.Oncefinished,theon-disk
filesystemstateisrevertedtoaconsistentstateandthesystemproceedswiththe
mountoperation,thusmakingthediskaccessibletousers.Thetoolexecutesits
operationsinanumberofphases,closelycheckingforconsistencyofeachon-
diskstructuresuchassuperblock,inodeblock,freeblocks,checkingindividual
inodesforvalidstate,directorychecks,andbadblockcheckineachphase.
Thoughitprovidesmuch-neededcrashrecovery,ithasitsdownsides:such
phasedoperationscanconsumealotoftimetocompleteonalargediskvolume,
whichdirectlyimpactsthesystem'sboottime.
Journalingisanothertechniqueengagedbymostmodernfilesystem
implementationsforquickandreliablecrashrecovery.Thismethodisenforced
byprogrammingappropriatefilesystemoperationsforcrashrecovery.Theidea
istopreparealog(note)listingoutchangestobecommittedtotheon-disk
imageofthefilesystem,andwritingthelogtoaspecialdiskblockcalleda
journalblock,beforebeginningtheactualupdateoperation.Thisensuresthat
onacrashduringactualupdate,thefilesystemcaneasilydetectinconsistencies
andfixthembylookingthroughinformationrecordedinthelog.Thus,an
implementationofjournalingfilesystemeliminatestheneedforthetediousand
expensivetaskofdiskscan,bymarginallyextendingworkdoneduringan
update.
Accesscontrollists(ACLs)
Thedefaultfileanddirectoryaccesspermissionsthatspecifyaccessrightsfor
theowner,thegrouptowhichownerbelongs,andothersusersdoesnotoffer
fine-grainedcontrolrequiredinsomesituations.ACLsareafeaturethatenable
anextendedmechanismtospecifyfileaccesspermissionsforvariousprocesses
andusers.Thisfeatureconsidersallfilesanddirectoriesasobjects,andallows
systemadministratorstodefinealistofaccesspermissionsforeach.ACLs
includeoperationsvalidonanobjectwithaccessprivileges,andrestrictionsfor
eachuserandsystemprocessonaspecifiedobject.
FilesystemsintheLinuxkernel
Nowthatwearefamiliarwithfundamentalconceptsrelatedtofilesystem
implementations,wewillexplorefilesystemservicessupportedbyLinux
systems.Thekernel'sfilesystembranchhasimplementationsofnumerous
filesystemservices,whichsupportdiversefiletypes.Basedonthetypeoffiles
theymanage,thekernel'sfilesystemscanbebroadlycategorizedinto:
1. Storagefilesystems
2. Specialfilesystems
3. Distributedfilesystemsornetworkfilesystems
Weshalldiscussspecialfilesystemsinalatersectionofthischapter.
Storagefilesystems:Kernelsupportsvariouspersistentstoragefilesystems,
whichcanbebroadlycategorizedintovariousgroupsbasedonthetypeof
storagedevicetheyaredesignedtomanage.
Diskfilesystems:Thiscategoryincludesvariousstandardstoragedisk
filesystemssupportedbythekernel,whichincludestheLinuxnativeext
familyofdiskfilesystems,suchasExt2,Ext3,Ext4,ReiserFS,andBtrfs;
Unixvariantssuchasthesysvfilesystem,UFS,andMINIXfilesystem;
MicrosoftfilesystemssuchasMS-DOS,VFAT,andNTFS;other
proprietaryfilesystemssuchasIBM'sOS/2(HPFS),Qnxbasedfilesystems
suchasqnx4andqnx6,Apple'sMacintoshHFSandHFS2,Amiga'sFast
Filesystem(AFFS),andAcornDiskFilingSystem(ADFS);andjournaling
filesystemslikeIBM'sJFSandSGI'sXFS.
Removablemediafilesystems:Thiscategoryincludesfilesystems
designedforCD,DVD,andothermovablestoragemediadevices,suchas
theISO9660CD-ROMfilesystemandUniversalDiskFormat(UDF)DVD
filesystem,andsquashfsusedinliveCDimagesforLinuxdistributions.
Semiconductorstoragefilesystems:Thiscategoryincludesfilesystems
designedandimplementedforrawflashandothersemiconductorstorage
devicesthatrequiresupportofwear-levelinganderaseoperations.The
currentsetoffilesystemssupportedincludeUBIFS,JFFS2,CRAMFS,and
soon.
Weshalldiscussinbriefafewnativediskfilesystemsinthekernel,whichare
usedacrossvariousdistributionsofLinuxasdefault.
Extfamilyfilesystems
TheinitialreleaseoftheLinuxkernelusedMINIXasthedefaultnative
filesystem,whichwasdesignedforuseintheMinixkernelforeducational
purposesandhencehadmanyusagelimitations.Asthekernelmatured,kernel
developersbuiltanewnativefilesystemfordiskmanagementcalledthe
extendedfilesystem.Thedesignofextwasheavilyinfluencedbythestandard
UnixfilesystemUFS.Duetovariousimplementationlimitationsandlackof
efficiency,theoriginalextwasshortlivedandwassoonreplacedbyan
improved,stable,andefficientversionnamedsecondextendedfilesystem
(Ext2).TheExt2filesystemcontinuedtobethedefaultnativefilesystemfor
quitealongperiodoftime(until2001,withthe2.4.15releaseoftheLinux
kernel).
Later,rapidevolutionindiskstoragetechnologiesledtoamassiveincreasein
storagecapacityandefficiencyofstoragehardware.Toexploitfeaturesprovided
bystoragehardware,thekernelcommunityevolvedforksofext2with
appropriatedesignimprovementsandaddedfeaturesthatarebestsuitablefora
specificclassofstorage.CurrentversionsoftheLinuxkernelcontainthree
versionsofextendedfilesystems,calledExt2,Ext3,andExt4.
Ext2
TheExt2filesystemwasfirstintroducedinkernelversion0.99.7(1993).It
retainsthecoredesignofclassicUFS(Unixfilesystem)withwrite-backcaching,
whichenablesshortturnaroundtimeandimprovedperformance.Althoughit
wasimplementedtosupportdiskvolumesintherangeof2TBto32TBandfile
sizesintherangeof16GBto2TB,itsusagewasrestrictedforupto4TBdisk
volumesand2GBmaxfilesizesduetoblockdeviceandapplicationimposed
restrictionsin2.4kernels.ItalsoincludessupportforACLs,filememorymaps,
andcrashrecoverythroughtheconsistencycheckertoolfsck.Ext2divides
physicaldisksectorsintofixed-sizeblockgroups.Afilesystemlayoutis
constructedforeachblockgroup,witheachhavingacompletesuperblock,free
blockbitmap,inodebitmap,inode,anddatablocks.Thus,eachblockgroup
appearsasaminiaturefilesystem.Thisdesignassistsfsckwithfasterconsistency
checksonalargedisk.
Ext3
Alsocalledthirdextendedfilesystem,itextendsthefunctionalityofExt2with
journaling.ItretainstheentirestructureofExt2withblockgroups,which
enablesseamlessconversionofanExt2partitionintoanExt3type.Asdiscussed
earlier,journalingcausesthefilesystemtologdetailsofanupdateoperationinto
specificregionsofdiskcalledjournalblocks;theselogshelpexpeditecrash
recoveryandensureconsistencyandreliabilityofthefilesystem.However,on
journalingfilesystems,diskupdateoperationscanturnexpensiveduetoslower
orvariable-timewriteoperations(duetojournallog)whichwoulddirectly
impactperformanceofregularfileI/O.Asasolution,Ext3providesjournal
configurationoptionsthroughwhichsystemadministratorsoruserscanselect
specifictypesofinformationtobeloggedtoajournal.Theseconfiguration
optionsarereferredtoasjournalingmodes.
1. Journalmode:Thismodecausesthefilesystemtorecordbothfiledataand
metadatachangesintothejournal.Thisresultsinmaximizedfilesystem
consistencywithincreaseddiskaccess,causingslowerupdates.Thismode
causesthejournaltoconsumeadditionaldiskblocksandistheslowestExt3
journalingmode.
2. Orderedmode:Thismoderecordsonlyfilesystemmetadataintothe
journal,butitguaranteesthatrelatedfiledataiswrittentodiskbefore
associatedmetadataiscommittedtothejournalblock.Thisensuresthatfile
dataisvalid;ifacrashoccurswhileexecutingwritetoafile,thejournal
willindicatethattheappendeddatahasnotbeencommitted,resultingina
purgeoperationonsuchdatabythecleanupprocess.Thisisthedefault
journalingmodeofExt3.
3. Writebackmode:Thisissimilartoorderedmodewithonlymetadata
journaling,butwithanexceptionthattherelatedfilecontentsmightbe
writtentodiskbeforeorafterthemetadataiscommittedtojournal.This
canresultincorruptionoffiledata.Forexample,considerafilebeing
appendedtomaybemarkedinthejournalascommittedbeforeactualfile
write:ifacrashoccursduringthefileappendoperation,thenthejournal
suggeststhefilebeinglargerthanitactuallyis.Thismodeisfastestbut
minimizesfiledatareliability.Manyotherjournalingfilesystemssuchas
JFSusesthismodeofjournaling,butensurethatanygarbagedueto
unwrittendataiszeroedoutonreboot.
Allofthesemodeshaveasimilareffectwithrespecttotheconsistencyof
metadata,butdifferinconsistencyoffileanddirectorydata,withjournalmode
ensuringmaximumsafetywithminimalchanceoffiledatacorruption,and
writebackmodeofferingminimalsafetywithhighriskofcorruption.
Administratorsoruserscantunetheappropriatemodeduringmountoperation
onanExt3volume.
Ext4
ImplementedasareplacementtoExt3withenhancedfeatures,Ext4first
appearedinkernel2.6.28(2008).ItisfullybackwardcompatiblewithExt2and
Ext3,andavolumeofeithertypecanbemountedasExt4.Thisisthedefaultext
filesystemonmostcurrentLinuxdistributions.Itextendsjournalingcapabilities
ofExt3withjournalchecksumswhichincreasesitsreliability.Italsoadds
checksumsforfilesystemmetadataandsupportstransparentencryption,
resultinginenhancedfilesystemintegrityandsecurity.Otherfeaturesinclude
supportforextents,whichhelpreducefragmentation,persistentpreallocationof
diskblocks,whichenablesallocationofcontiguousblocksformediafiles,and
supportfordiskvolumeswithstoragecapacitiesupto1exbibyte(EiB)andfiles
withsizesupto16tebibytes(TiB).
Commonfilesysteminterface
Presenceofdiversefilesystemsandstoragepartitionsresultsineachfilesystem
maintainingitstreeoffilesanddatastructuresthataredistinctfromothers.
Uponmount,eachfilesystemwillrequiretomanageitsin-memoryfiletreesin
isolationfromothers,resultinginaninconsistentviewofthefiletreeforsystem
usersandapplications.Thiscomplicateskernelsupportforvariousfile
operationssuchasopen,read,write,copy,andmove.Asasolution,theLinux
kernel(likemanyotherUnixsystems)engagesanabstractionlayercalled
virtualfilesystem(VFS)thathidesallfilesystemimplementationswitha
commoninterface.
TheVFSlayerbuildsacommonfiletreecalledrootfs,underwhichall
filesystemscanenumeratetheirdirectoriesandfiles.Thisenablesallfilesystem-
specificsubtreeswithdistincton-diskrepresentationstobeunifiedandpresented
asasinglefilesystem.Systemusersandapplicationshaveaconsistent,
homogeneousviewofthefiletree,resultinginflexibilityforthekerneltodefine
asimplifiedsetofcommonsystemcallsthatapplicationscanengageforfileI/O,
regardlessofunderlyingfilesystemsandtheirrepresentations.Thismodel
ensuressimplicityinapplicationdesignduetolimitedandflexibleAPIsand
enablesseamlesscopyormovementoffilesfromonediskpartitionorfilesystem
treetoanother,irrespectiveofunderlyingdissimilarities.
Thefollowingdiagramdepictsthevirtualfilesystem:
VFSdefinestwosetsoffunctions:first,asetofgenericfilesystem-independent
routinesthatserveascommonentryfunctionsforallfileaccessand
manipulationoperations,andsecond,asetofabstractoperationinterfacesthat
arefilesystemspecific.Eachfilesystemdefinesitsoperations(asperitsnotion
offilesanddirectories)andmapsthemtoanabstractinterfaceprovided,and
withthevirtualfilesystem,thisenablesVFStohandlefileI/Orequestsby
dynamicallyswitchingintounderlyingfilesystem-specificfunctions.
VFSstructuresandoperations
DecipheringthekeyobjectsanddatastructuresofVFSletsusgainclarityon
howtheVFSinternallyworkswithfilesystemsandenablestheall-important
abstraction.Followingarefourelementaldatastructuresaroundwhichtheentire
webofabstractionisweaved:
structsuper_block--whichcontainsinformationonspecificfilesystemsthat
havebeenmounted
structinode--whichrepresentsaspecificfile
structdentry--representingadirectoryentry
structfile--representingthefilewhichhasbeenopenedandlinkedtoa
process
Allofthesedatastructuresareboundtoappropriateabstractoperationinterfaces
thataredefinedbyfilesystems.
structsuper_block{<br/>structlist_heads_list;/*Keepthisfirst
*/<br/>dev_ts_dev;/*searchindex;_not_kdev_t*/<br/>unsigned
chars_blocksize_bits;<br/>unsignedlongs_blocksize;<br/>loff_t
s_maxbytes;/*Maxfilesize*/<br/>structfile_system_type*s_type;
<br/>conststructsuper_operations*s_op;<br/>conststruct
dquot_operations*dq_op;<br/>conststructquotactl_ops*s_qcop;
<br/>conststructexport_operations*s_export_op;<br/>unsigned
longs_flags;<br/>unsignedlongs_iflags;/*internalSB_I_*flags
*/<br/>unsignedlongs_magic;<br/>structdentry*s_root;<br/>
structrw_semaphores_umount;<br/>ints_count;<br/>atomic_t
s_active;<br/>#ifdefCONFIG_SECURITY<br/>void*s_security;
<br/>#endif<br/>conststructxattr_handler**s_xattr;<br/>const
structfscrypt_operations*s_cop;<br/>structhlist_bl_heads_anon;
<br/>structlist_heads_mounts;/*listofmounts;_not_forfsuse*/
<br/>structblock_device*s_bdev;<br/>structbacking_dev_info
*s_bdi;<br/>structmtd_info*s_mtd;<br/>structhlist_node
s_instances;<br/>unsignedints_quota_types;/*Bitmaskofsupported
quotatypes*/<br/>structquota_infos_dquot;/*Diskquotaspecific
options*/<br/>structsb_writerss_writers;<br/>chars_id[32];/*
Informationalname*/<br/>u8s_uuid[16];/*UUID*/<br/>void
*s_fs_info;/*Filesystemprivateinfo*/<br/>unsignedint
s_max_links;<br/>fmode_ts_mode;<br/><br/>/*Granularityof
c/m/atimeinns.<br/>Cannotbeworsethanasecond*/<br/>u32
s_time_gran;<br/><br/><br/>structmutexs_vfs_rename_mutex;/*
Kludge*/<br/><br/>/*<br/>*Filesystemsubtype.Ifnon-emptythe
filesystemtypefield<br/>*in/proc/mountswillbe"type.subtype"
<br/>*/<br/>char*s_subtype;<br/><br/>/*<br/>*Savedmount
optionsforlazyfilesystemsusing<br/>*generic_show_options()
<br/>*/<br/>char__rcu*s_options;<br/>conststruct
dentry_operations*s_d_op;/*defaultopfordentries*/<br/>/*<br/>*
Savedpoolidentifierforcleancache(-1meansnone)<br/>*/<br/>int
cleancache_poolid;<br/><br/>structshrinkers_shrink;/*per-sb
shrinkerhandle*/<br/><br/>/*Numberofinodeswithnlink==0
butstillreferenced*/<br/>atomic_long_ts_remove_count;<br/>
<br/>/*Beingremountedread-only*/<br/>ints_readonly_remount;
<br/><br/>/*AIOcompletionsdeferredfrominterruptcontext
*/<br/>structworkqueue_struct*s_dio_done_wq;<br/>struct
hlist_heads_pins;<br/><br/>/*<br/>*Owningusernamespaceand
defaultcontextinwhichto<br/>*interpretfilesystemuids,gids,
quotas,devicenodes,<br/>*xattrsandsecuritylabels.<br/>*/<br/>
structuser_namespace*s_user_ns;<br/><br/><br/>structlist_lru
s_dentry_lru____cacheline_aligned_in_smp;<br/>structlist_lru
s_inode_lru____cacheline_aligned_in_smp;<br/>structrcu_head
rcu;<br/>structwork_structdestroy_work;<br/><br/>structmutex
s_sync_lock;/*syncserialisationlock*/<br/><br/>/*<br/>*
IndicateshowdeepinafilesystemstackthisSBis<br/>*/<br/>int
s_stack_depth;<br/><br/>/*s_inode_list_lockprotectss_inodes
*/<br/>spinlock_ts_inode_list_lock____cacheline_aligned_in_smp;
<br/>structlist_heads_inodes;/*allinodes*/<br/><br/>spinlock_t
s_inode_wblist_lock;<br/>structlist_heads_inodes_wb;/*writeback
inodes*/<br/>};
structsuper_operations{<br/>structinode*(*alloc_inode)(struct
super_block*sb);<br/>void(*destroy_inode)(structinode*);<br/>
<br/>void(*dirty_inode)(structinode*,intflags);<br/>int
(*write_inode)(structinode*,structwriteback_control*wbc);<br/>
int(*drop_inode)(structinode*);<br/>void(*evict_inode)(struct
inode*);<br/>void(*put_super)(structsuper_block*);<br/>int
(*sync_fs)(structsuper_block*sb,intwait);<br/>int(*freeze_super)
(structsuper_block*);<br/>int(*freeze_fs)(structsuper_block*);
<br/>int(*thaw_super)(structsuper_block*);<br/>int
(*unfreeze_fs)(structsuper_block*);<br/>int(*statfs)(structdentry
*,structkstatfs*);<br/>int(*remount_fs)(structsuper_block*,int*,
char*);<br/>void(*umount_begin)(structsuper_block*);<br/>
<br/>int(*show_options)(structseq_file*,structdentry*);<br/>int
(*show_devname)(structseq_file*,structdentry*);<br/>int
(*show_path)(structseq_file*,structdentry*);<br/>int
(*show_stats)(structseq_file*,structdentry*);<br/>#ifdef
CONFIG_QUOTA<br/>ssize_t(*quota_read)(structsuper_block*,
int,char*,size_t,loff_t);<br/>ssize_t(*quota_write)(struct
super_block*,int,constchar*,size_t,loff_t);<br/>structdquot**
(*get_dquots)(structinode*);<br/>#endif<br/>int
(*bdev_try_to_free_page)(structsuper_block*,structpage*,gfp_t);
<br/>long(*nr_cached_objects)(structsuper_block*,<br/>struct
shrink_control*);<br/>long(*free_cached_objects)(struct
super_block*,<br/>structshrink_control*);<br/>};
Allelementsinthisstructurepointtofunctionsthatoperateonthe
superblockobject.Alltheseoperationsareonlycalledfromaprocess
contextandwithoutanylocksbeingheld,unlessspecified.Let'slook
atfewimportantoneshere:
alloc_inode:Thismethodisusedtocreateandallocatespacefor
thenewinodeobjectandinitializeitunderthesuperblock.
destroy_inode:Thisdestroysthegiveninodeobjectandfrees
resourcesallocatedfortheinode.Thisisonlyusedif
alloc_inodewasdefined.
dirty_inode:ThisiscalledbytheVFStomarkadirtyinode
(wheninodeismodified).
write_inode:VFSinvokesthismethodwhenitneedstowritean
inodeontothedisk.Thesecondargumentpointstostruct
writeback_control,astructurethattellsthewritebackcodewhat
todo.
put_super:ThisisinvokedwhenVFSneedstofreethe
superblock.
sync_fs:Thisisinvokedtosynchronizefilesystemdatawiththat
oftheunderlyingblockdevice.
statfs:InvokedtogetfilesystemstatisticsfortheVFS.
remount_fs:Invokedwhenthefilesystemneedstoberemounted.
umount_begin:InvokedwhentheVFSisunmountinga
filesystem.
show_options:InvokedbyVFStoshowmountoptions.
quota_read:InvokedbyVFStoreadfromthefilesystemquota
file.
structinode
Eachinstanceofstructinoderepresentsafileinrootfs.VFSdefinesthisstructure
asanabstractionforfilesystem-specificinodes.Irrespectiveofthetypeofinode
structureanditsrepresentationondisk,eachfilesystemneedstoenumerateits
filesasstructinodeintorootfsforacommonfileview.Thisstructureisdefinedin
<linux/fs.h>:
structinode{
umode_ti_mode;
unsignedshorti_opflags;
kuid_ti_uid;
kgid_ti_gid;
unsignedinti_flags;
#ifdefCONFIG_FS_POSIX_ACL
structposix_acl*i_acl;
structposix_acl*i_default_acl;
#endif
conststructinode_operations*i_op;
structsuper_block*i_sb;
structaddress_space*i_mapping;
#ifdefCONFIG_SECURITY
void*i_security;
#endif
/*Statdata,notaccessedfrompathwalking*/
unsignedlongi_ino;
/*
*Filesystemsmayonlyreadi_nlinkdirectly.Theyshallusethe
*followingfunctionsformodification:
*
*(set|clear|inc|drop)_nlink
*inode_(inc|dec)_link_count
*/
union{
constunsignedinti_nlink;
unsignedint__i_nlink;
};
dev_ti_rdev;
loff_ti_size;
structtimespeci_atime;
structtimespeci_mtime;
structtimespeci_ctime;
spinlock_ti_lock;/*i_blocks,i_bytes,maybei_size*/
unsignedshorti_bytes;
unsignedinti_blkbits;
blkcnt_ti_blocks;
#ifdef__NEED_I_SIZE_ORDERED
seqcount_ti_size_seqcount;
#endif
/*Misc*/
unsignedlongi_state;
structrw_semaphorei_rwsem;
unsignedlongdirtied_when;/*jiffiesoffirstdirtying*/
unsignedlongdirtied_time_when;
structhlist_nodei_hash;
structlist_headi_io_list;/*backingdevIOlist*/
#ifdefCONFIG_CGROUP_WRITEBACK
structbdi_writeback*i_wb;/*theassociatedcgroupwb*/
/*foreigninodedetection,seewbc_detach_inode()*/
inti_wb_frn_winner;
u16i_wb_frn_avg_time;
u16i_wb_frn_history;
#endif
structlist_headi_lru;/*inodeLRUlist*/
structlist_headi_sb_list;
structlist_headi_wb_list;/*backingdevwritebacklist*/
union{
structhlist_headi_dentry;
structrcu_headi_rcu;
};
u64i_version;
atomic_ti_count;
atomic_ti_dio_count;
atomic_ti_writecount;
#ifdefCONFIG_IMA
atomic_ti_readcount;/*structfilesopenRO*/
#endif
/*former->i_op>default_file_ops*/
conststructfile_operations*i_fop;
structfile_lock_context*i_flctx;
structaddress_spacei_data;
structlist_headi_devices;
union{
structpipe_inode_info*i_pipe;
structblock_device*i_bdev;
structcdev*i_cdev;
char*i_link;
unsignedi_dir_seq;
};
__u32i_generation;
#ifdefCONFIG_FSNOTIFY__u32i_fsnotify_mask;/*alleventsthisinodecaresabout*/
structhlist_headi_fsnotify_marks;
#endif
#ifIS_ENABLED(CONFIG_FS_ENCRYPTION)
structfscrypt_info*i_crypt_info;
#endif
void*i_private;/*fsordeviceprivatepointer*/
};
Notethatallfieldsarenotmandatoryandapplicabletoallfilesystems;theyare
freetoinitializeappropriatefieldsthatarerelevantaspertheirdefinitionofan
inode.Eachinodeisboundtotwoimportantgroupsofoperationsdefinedbythe
underlyingfilesystem:first,asetofoperationstomanageinodedata.Theseare
representedthroughaninstanceoftypestructinode_operationsthatisreferredto
bythei_oppointeroftheinode.Secondisagroupofoperationsforaccessing
andmanipulatingunderlyingfiledatathattheinoderepresents;theseoperations
areencapsulatedinaninstanceoftypestructfile_operationsandboundtothe
i_foppointerofinodeinstance.
Inotherwords,eachinodeisboundtometadataoperationsrepresentedbyan
instanceoftypestructinode_operations,andfiledataoperationsrepresentedbyan
instanceoftypestructfile_operations.However,user-modeapplicationsaccess
filedataoperationsfromavalidfileobjectcreatedtorepresentanopenfilefor
thecallerprocess(wewilldiscussmoreonfileobjectinnextsection):struct
inode_operations{
structdentry*(*lookup)(structinode*,structdentry*,unsignedint);
constchar*(*get_link)(structdentry*,structinode*,structdelayed_call*);
int(*permission)(structinode*,int);
structposix_acl*(*get_acl)(structinode*,int);
int(*readlink)(structdentry*,char__user*,int);
int(*create)(structinode*,structdentry*,umode_t,bool);
int(*link)(structdentry*,structinode*,structdentry*);
int(*unlink)(structinode*,structdentry*);
int(*symlink)(structinode*,structdentry*,constchar*);
int(*mkdir)(structinode*,structdentry*,umode_t);
int(*rmdir)(structinode*,structdentry*);
int(*mknod)(structinode*,structdentry*,umode_t,dev_t);
int(*rename)(structinode*,structdentry*,
structinode*,structdentry*,unsignedint);
int(*setattr)(structdentry*,structiattr*);
int(*getattr)(structvfsmount*mnt,structdentry*,structkstat*);
ssize_t(*listxattr)(structdentry*,char*,size_t);
int(*fiemap)(structinode*,structfiemap_extent_info*,u64start,
u64len);
int(*update_time)(structinode*,structtimespec*,int);
int(*atomic_open)(structinode*,structdentry*,
structfile*,unsignedopen_flag,
umode_tcreate_mode,int*opened);
int(*tmpfile)(structinode*,structdentry*,umode_t);
int(*set_acl)(structinode*,structposix_acl*,int);
}____cacheline_aligned
Followingisabriefdescriptionoffewimportantoperations:
lookup:Usedtolocateinodeinstanceofthefilespecified;thisoperation
returnsadentryinstance.
create:ThisroutineisinvokedbyVFStoconstructaninodeobjectfor
dentryspecifiedasanargument.
link:Usedtosupporthardlinks.Calledbythelink(2)systemcall.
unlink:Usedtosupportdeletinginodes.Calledbytheunlink(2)systemcall.
mkdir:Usedtosupportcreationofsubdirectories.Calledbythemkdir(2)
systemcall.
mknod:Invokedbythemknod(2)systemcalltocreateadevice,namedpipe,
inode,orsocket.
listxattr:InvokedbytheVFStolistallextendedattributesofafile.
update_time:InvokedbytheVFStoupdateaspecifictimeorthei_versionof
theinode.
ThefollowingisVFS-definedstructfile_operations,whichencapsulates
filesystem-definedoperationsontheunderlyingfiledata.Sincethisisdeclared
toserveasacommoninterfaceforallfilesystems,itcontainsfunctionpointer
interfacessuitabletosupportoperationsonvarioustypesoffilesystemswith
distinctdefinitionsoffiledata.Underlyingfilesystemsarefreetochoose
appropriateinterfacesandleavetherest,dependingontheirnotionoffileand
filedata:structfile_operations{
structmodule*owner;
loff_t(*llseek)(structfile*,loff_t,int);
ssize_t(*read)(structfile*,char__user*,size_t,loff_t*);
ssize_t(*write)(structfile*,constchar__user*,size_t,loff_t*);
ssize_t(*read_iter)(structkiocb*,structiov_iter*);
ssize_t(*write_iter)(structkiocb*,structiov_iter*);
int(*iterate)(structfile*,structdir_context*);
int(*iterate_shared)(structfile*,structdir_context*);
unsignedint(*poll)(structfile*,structpoll_table_struct*);
long(*unlocked_ioctl)(structfile*,unsignedint,unsignedlong);
long(*compat_ioctl)(structfile*,unsignedint,unsignedlong);
int(*mmap)(structfile*,structvm_area_struct*);
int(*open)(structinode*,structfile*);
int(*flush)(structfile*,fl_owner_tid);
int(*release)(structinode*,structfile*);
int(*fsync)(structfile*,loff_t,loff_t,intdatasync);
int(*fasync)(int,structfile*,int);
int(*lock)(structfile*,int,structfile_lock*);
ssize_t(*sendpage)(structfile*,structpage*,int,size_t,loff_t*,int);
unsignedlong(*get_unmapped_area)(structfile*,unsignedlong,unsigned
long,unsignedlong,unsignedlong);
int(*check_flags)(int);
int(*flock)(structfile*,int,structfile_lock*);
ssize_t(*splice_write)(structpipe_inode_info*,structfile*,loff_t*,size_t,
unsignedint);
ssize_t(*splice_read)(structfile*,loff_t*,structpipe_inode_info*,size_t,
unsignedint);
int(*setlease)(structfile*,long,structfile_lock**,void**);
long(*fallocate)(structfile*file,intmode,loff_toffset,
loff_tlen);
void(*show_fdinfo)(structseq_file*m,structfile*f);
#ifndefCONFIG_MMU
unsigned(*mmap_capabilities)(structfile*);
#endif
ssize_t(*copy_file_range)(structfile*,loff_t,structfile*,
loff_t,size_t,unsignedint);
int(*clone_file_range)(structfile*,loff_t,structfile*,loff_t,
u64);
ssize_t(*dedupe_file_range)(structfile*,u64,u64,structfile*,
u64);
};
Followingisabriefdescriptionofafewimportantoperations:
llseek:InvokedwhentheVFSneedstomovethefilepositionindex.
read:Invokedbyread(2)andotherrelatedsystemcalls.
write:Invokedbythewrite(2)andotherrelatedsystemcalls.
iterate:InvokedwhenVFSneedstoreaddirectorycontents.
poll:ThisisinvokedbytheVFSwhenaprocessneedstocheckforactivity
onthefile.Calledbyselect(2)andpoll(2)systemcalls.
unlocked_ioctl:Theoperationassignedtothispointerisinvokedwhenthe
user-modeprocesscallstheioctl(2)systemcallonthefiledescriptor.This
functionisusedtosupportspecialoperations.Devicedriversusethis
interfacetosupportconfigurationoperationsonthetargetdevice.
compat_ioctl:Similartoioctlwithanexceptionthatitisusedtoconvert
argumentspassedfroma32-bitprocesstobeusedwitha64-bitkernel.
mmap:Theroutineassignedtothispointerisinvokedwhentheuser-mode
processcallsthemmap(2)systemcall.Functionalitysupportedbythis
functionisunderlyingfilesystemdependent.Forregularpersistentfiles,
thisfunctionisimplementedtomapthecaller-specifieddataregionofthe
fileintothevirtualaddressspaceofthecallerprocess.Fordevicefilesthat
supportmmap,thisroutinemapsunderlyingdeviceaddressspaceintothe
caller'svirtualaddressspace.
open:ThefunctionassignedtothisinterfaceisinvokedbyVFSwhenthe
user-modeprocessinitiatestheopen(2)systemcalltocreateafiledescriptor.
flush:Invokedbytheclose(2)systemcalltoflushafile.
release:AfunctionassignedtothisinterfaceisinvokedbyVFSwhena
user-modeprocessexecutestheclose(2)systemcalltodestroyafile
descriptor.
fasync:Invokedbythefcntl(2)systemcallwhenasynchronousmodeis
enabledforafile.
splice_write:InvokedbytheVFStosplicedatafromapipetoafile.
setlease:InvokedbytheVFStosetorreleaseafilelocklease.
fallocate:InvokedbytheVFStopre-allocateablock.
Structdentry
Inourearlierdiscussion,wegainedanunderstandingonhowatypicaldisk
filesystemrepresentseachdirectorythroughaninodestructure,andhowa
directoryblockondiskrepresentsinformationoffilesunderthatdirectory.
Whenuser-modeapplicationsinitiatefileaccessoperationssuchasopen()witha
completepathsuchas/root/test/abc,theVFSwillneedtoperformdirectory
lookupoperationstodecodeandvalidateeachcomponentspecifiedinthepath.
Forefficientlookupandtranslationofcomponentsinafilepath,VFS
enumeratesaspecialdatastructure,calleddentry.Adentryobjectcontainsa
stringnameofthefileordirectory,apointertoitsinode,andapointertotheparent
dentry.Aninstanceofdentryisgeneratedforeachcomponentinthefilelookup
path;forinstance,inthecaseof/root/test/abc,adentryisenumeratedforroot,
anotherfortest,andfinallyforfileabc.
structdentryisdefinedinkernelheader</linux/dcache.h>:
structdentry{
/*RCUlookuptouchedfields*/
unsignedintd_flags;/*protectedbyd_lock*/
seqcount_td_seq;/*perdentryseqlock*/
structhlist_bl_noded_hash;/*lookuphashlist*/
structdentry*d_parent;/*parentdirectory*/
structqstrd_name;
structinode*d_inode;/*Wherethename-NULLisnegative*/
unsignedchard_iname[DNAME_INLINE_LEN];/*smallnames*/
/*Reflookupalsotouchesfollowing*/
structlockrefd_lockref;/*per-dentrylockandrefcount*/
conststructdentry_operations*d_op;
structsuper_block*d_sb;/*Therootofthedentrytree*/
unsignedlongd_time;/*usedbyd_revalidate*/
void*d_fsdata;/*fs-specificdata*/
union{
structlist_headd_lru;/*LRUlist*/
wait_queue_head_t*d_wait;/*in-lookuponesonly*/
};
structlist_headd_child;/*childofparentlist*/
structlist_headd_subdirs;/*ourchildren*/
/*
*d_aliasandd_rcucansharememory
*/
union{
structhlist_noded_alias;/*inodealiaslist*/
structhlist_bl_noded_in_lookup_hash;
structrcu_headd_rcu;
}d_u;
};
d_parentispointertotheparentdentryinstance.
d_nameholdsthenameofthefile.
d_inodeisapointertotheinodeinstanceofthefile.
d_flagscontainsseveralflagsdefinedin<include/linux/dcache.h>.
d_oppointstothestructurecontainingfunctionpointerstovarious
operationsforthedentryobject.
Let'snowlookatstructdentry_operations,whichdescribeshowafilesystemcan
overloadthestandarddentryoperations:
structdentry_operations{
int(*d_revalidate)(structdentry*,unsignedint);
int(*d_weak_revalidate)(structdentry*,unsignedint);
int(*d_hash)(conststructdentry*,structqstr*);
int(*d_compare)(conststructdentry*,
unsignedint,constchar*,conststructqstr*);
int(*d_delete)(conststructdentry*);
int(*d_init)(structdentry*);
void(*d_release)(structdentry*);
void(*d_prune)(structdentry*);
void(*d_iput)(structdentry*,structinode*);
char*(*d_dname)(structdentry*,char*,int);
structvfsmount*(*d_automount)(structpath*);
int(*d_manage)(conststructpath*,bool);
structdentry*(*d_real)(structdentry*,conststructinode*,
unsignedint);
}____ca
Followingisabriefdescriptionofafewimportantdentryoperations:
d_revalidate:InvokedwhenVFSneedstorevalidateadentry.Whenevera
namelookupreturnsadentryinthedcache,thisiscalled.
d_weak_revalidate:InvokedwhenVFSneedstorevalidateajumpeddentry.
Thisisinvokedifapath-walkendsatadentrythatwasn'tfoundona
lookupontheparentdirectory.
d_hash:InvokedwhenVFSaddsadentrytothehashtable.
d_compare:Invokedtocomparethefilenamesoftwodentryinstances.It
comparesadentrynamewithagivenname.
d_delete:Invokedwhenthelastreferencetoadentryisremoved.
d_init:Invokedwhenadentryisallocated.
d_release:Invokedwhenadentryisdeallocated.
d_iput:Invokedwhenaninodeisreleasedfromthedentry.
d_dname:Invokedwhenthepathnameofthedentrymustbegenerated.Handy
forspecialfilesystemstodelaypathnamegeneration(wheneverthepathis
needed).
structfile{<br/>union{<br/>structllist_nodefu_llist;<br/>struct
rcu_headfu_rcuhead;<br/>}f_u;<br/>structpathf_path;<br/>struct
inode*f_inode;/*cachedvalue*/<br/>conststructfile_operations
*f_op;<br/><br/>/*<br/>*Protectsf_ep_links,f_flags.<br/>*Must
notbetakenfromIRQcontext.<br/>*/<br/>spinlock_tf_lock;<br/>
atomic_long_tf_count;<br/>unsignedintf_flags;<br/>fmode_t
f_mode;<br/>structmutexf_pos_lock;<br/>loff_tf_pos;<br/>struct
fown_structf_owner;<br/>conststructcred*f_cred;<br/>struct
file_ra_statef_ra;<br/><br/>u64f_version;<br/>#ifdef
CONFIG_SECURITY<br/>void*f_security;<br/>#endif<br/>/*
neededforttydriver,andmaybeothers*/<br/>void*private_data;
<br/><br/>#ifdefCONFIG_EPOLL<br/>/*Usedbyfs/eventpoll.cto
linkallthehookstothisfile*/<br/>structlist_headf_ep_links;<br/>
structlist_headf_tfile_llink;<br/>#endif/*#ifdefCONFIG_EPOLL
*/<br/>structaddress_space*f_mapping;<br/>}
__attribute__((aligned(4)));/*lestsomethingweirddecidesthat2is
OK*/
Thef_inodepointerreferstotheinodeinstanceofthefile.Whena
fileobjectisconstructedbyVFS,thef_oppointerisinitializedwith
theaddressofstructfile_operationsassociatedwiththefile's
inode,aswediscussedearlier.
Specialfilesystems
Unlikeregularfilesystems,whicharedesignedtomanagepersistentfiledata
backedontoastoragedevice,thekernelimplementsvariousspecialfilesystems
thatmanageaspecificclassofkernelin-coredatastructures.Sincethese
filesystemsdonotdealwithpersistentdata,theydonotconsumediskblocks,
andtheentirefilesystemstructureismaintainedin-core.Presenceofsuch
filesystemsenablessimplifiedapplicationdevelopment,debugging,andeasier
errordetection.Therearemanyfilesystemsinthiscategory,eachdeliberately
designedandimplementedforaspecificpurpose.Followingisbriefdescription
ofafewimportantones.
Procfs
Procfsisaspecialfilesystemthatenumerateskerneldatastructuresasfiles.This
filesystemservesasadebuggingresourceforkernelprogrammers,sinceit
allowsuserstoviewthestateofdatastructuresthroughthevirtualfileinterface.
Procfsismountedtothe/procdirectory(mountpoint)ofrootfs.
Datainprocfsfilesisnotpersistent,andisalwaysconstructedontherun;each
fileisaninterfacethroughwhichuserscantriggerassociatedoperations.For
instance,areadoperationonaprocfileinvokestheassociatedreadcallback
functionboundtothefileentry,andthatfunctionisimplementedtopopulatethe
userbufferwithappropriatedata.
Thenumberoffilesenumerateddependsontheconfigurationandarchitecture
forwhichthekernelwasbuilt.Followingisalistofafewimportantfileswith
usefuldataenumeratedunder/proc:
Filename Description
/proc/cpuinfo
Provideslow-levelcpudetailssuchasvendor,model,clock
speed,cachesize,numberofsiblings,cores,CPUflags,and
bogomips.
/proc/meminfo Providesasummarizedviewofphysicalmemorystate.
/proc/ioports
ProvidesdetailsoncurrentusageofportI/Oaddressspace
supportedbythex86classofmachines.Thisfileisnot
presentonotherarchitectures.
/proc/iomem Showsadetailedlayoutdescribingcurrentusageof
memoryaddressspace.
/proc/interrupts ShowsaviewoftheIRQdescriptortablethatcontains
detailsofIRQlinesandinterrupthandlersboundtoeach.
/proc/slabinfo Showsadetailedlistingofslabcachesandtheircurrent
state.
/proc/buddyinfo Showsthecurrentstateofbuddylistsmanagedbythe
buddysystem.
/proc/vmstat Showsvirtualmemorymanagementstatistics.
/proc/zoneinfo Showsper-nodememoryzonestatistics.
/proc/cmdline Showsbootargumentspassedtothekernel.
/proc/timer_list Showsalistofactivependingtimers,withdetailsofclock
source.
/proc/timer_stats Providesdetailedstatisticsonactivetimers,usedfor
trackingtimerusageanddebugging.
/proc/filesystems Presentsalistoffilesystemservicescurrentlyactive.
/proc/mounts Showscurrentlymounteddeviceswiththeirmountpoints.
/proc/partitions Presentsdetailsofcurrentstoragepartitionsdetectedwith
associated/devfileenumerations.
/proc/swaps Listsoutactiveswappartitionswithstatusdetails.
/proc/modules Listsoutnamesandstatusofkernelmodulescurrently
deployed.
/proc/uptime Showslengthoftimekernelhasbeenrunningsinceboot
andspentinidlemode.
/proc/kmsg Showscontentsofkernel'smessagelogbuffer.
/proc/kallsyms Presentskernelsymboltable.
/proc/devices Presentsalistofregisteredblockandcharacterdeviceswith
theirmajornumbers.
/proc/misc Presentsalistofdevicesregisteredthroughthemisc
interfacewiththeirmiscidentifiers.
/proc/stat Presentssystemstatistics.
/proc/net Directorythatcontainsvariousnetworkstack-related
pseudofiles.
Subdirectorycontainingpseudofilesthatshowthestatusof
/proc/sysvipc SystemVIPCobjects,messagequeues,semaphores,and
sharedmemory.
/procalsolistsoutanumberofsubdirectoriesthatprovideadetailedviewof
elementsinprocessPCBortaskstructure.ThesefoldersarenamedbythePID
oftheprocessthattheyrepresent.Followingisalistofimportantfilesthat
presentprocess-relatedinformation:
Filename Description
/proc/pid/cmdline Command-linenameoftheprocess.
/proc/pid/exe Asymboliclinktotheexecutablefile.
/proc/pid/environ Listsoutenvironmentalvariablesaccessibletotheprocess.
/proc/pid/cwd Asymboliclinktothecurrentworkingdirectoryofthe
process.
/proc/pid/mem Abinaryimagethatshowsthevirtualmemoryofthe
process.
/proc/pid/maps Listsoutvirtualmemorymappingsfortheprocess.
/proc/pid/fdinfo Adirectorythatlistsoutopenfiledescriptors'currentstatus
andflags.
/proc/pid/fd Directorythatcontainssymlinktoopenfiledescriptors.
/proc/pid/status Listsoutcurrentstatusoftheprocess,includingitsmemory
usage.
/proc/pid/sched Listsoutschedulingstatistics.
/proc/pid/cpuset Listsoutthecpuaffinitymaskforthisprocess.
/proc/pid/cgroup Showscgroupdetailsfortheprocess.
/proc/pid/stack Showsbacktraceoftheprocess-ownedkernelstack.
/proc/pid/smaps Showsmemoryconsumedforeachmappingintoitsaddress
space.
/proc/pid/pagemap
Showsthephysicalmappingstatusforeachvirtualpageof
theprocess.
/proc/pid/syscall Exposesthesystemcallnumberandargumentsforthe
systemcallcurrentlybeingexecutedbytheprocess.
/proc/pid/task Directorycontainingchildprocess/threaddetails.
Theselistingsweredrawnuptofamiliarizeyouwithprocfilesand
theiruse.Youareadvisedtovisitthemanualpageofprocfsfora
detaileddescriptionofeachofthesefiles.
Allofthefileswelistedsofarareread-only;procfsalsocontainsabranch
/proc/systhatholdsread-writefiles,whicharereferredtoaskernelparameters.
Filesunder/proc/sysarefurtherclassifiedasperthesubsystemstowhichthey
apply.Listingoutallthosefilesisoutofscope.
Sysfs
Sysfsisanotherpseudofilesystemthatisintroducedtoexportunifiedhardware
anddriverinformationtousermode.Itenumeratesinformationaboutdevices
andassociateddevicedriversfromthekernel'sdevicemodelperspectivetouser
spacethroughvirtualfiles.Sysfsismountedtothe/sysdirectory(mountpoint)
oftherootfs.Similartoprocfs,underlyingdriversandkernelsubsystemscanbe
configuredforpowermanagementandotherfunctionalitiesthroughvirtualfile
interfacesofsysfs.SysfsalsoenableshotplugeventmanagementbyLinux
distrosthroughappropriatedaemonssuchasudev,whichisconfiguredtolisten
andrespondtohotplugevents.
Followingisabriefdescriptionofimportantsubdirectoriesofsysfs:
Devices:Oneoftheobjectivesbehindtheintroductionofsysfsistopresent
aunifiedlistofdevicescurrentlyenumeratedandmanagedbyrespective
driversubsystems.Thedevicesdirectorycontainstheglobaldevice
hierarchy,whichcontainsinformationforeachphysicalandvirtualdevice
thathasbeendiscoveredbythedriversubsystemsandregisteredwiththe
kernel.
BUS:Thisdirectorycontainsalistingofsubdirectories,eachrepresenting
thephysicalbustypethathassupportregisteredinthekernel.Eachbus
typedirectorycontainstwosubdirectories:devicesanddrivers.Thedevices
directorycontainsalistingofdevicescurrentlydiscoveredorboundtothat
bustype.Eachfileinthelistingisasymboliclinktothedevicefilein
device'sdirectoryintheglobaldevicetree.Thedriversdirectorycontains
directoriesdescribingeachdevicedriverregisteredwiththebusmanager.
Eachofthedriverdirectorieslistsoutattributesthatshowthecurrent
configurationofdriverparameters,whichcanbemodified,andsymbolic
linksthatpointtothephysicaldevicedirectorythatthedriverisboundto.
Class:Theclassdirectorycontainsrepresentationsofdeviceclassesthatare
currentlyregisteredwiththekernel.Adeviceclassdescribesafunctional
typeofdevice.Eachdeviceclassdirectorycontainssubdirectories
representingdevicescurrentlyallocatedandregisteredunderthisclass.For
mostoftheclassdeviceobjects,theirdirectoriescontainsymboliclinksto
thedeviceanddriverdirectoriesintheglobaldevicehierarchyandthebus
hierarchythatareassociatedwiththatclassobject.
Firmware:Thefirmwaredirectorycontainsinterfacesforviewingand
manipulatingplatform-specificfirmwarethatisrunduringpoweron/reset,
suchasBIOSorUEFIonx86andOpenFirmwareforPPCplatforms.
Modules:Thisdirectorycontainssubdirectoriesthatrepresenteachkernel
modulecurrentlydeployed.Eachdirectoryisenumeratedwiththenameof
themoduleitisrepresenting.Eachmoduledirectorycontainsinformation
aboutamodulesuchasrefcount,modparams,anditscoresize.
Debugfs
Unlikeprocfsandsysfs,whichareimplementedtopresentspecificinformation
throughthevirtualfileinterface,debugfsisagenericmemoryfilesystemthat
allowskerneldeveloperstoexportanyarbitraryinformationthatisdeemed
usefulfordebugging.Debugfsprovidesfunctioninterfacesusedtoenumerate
virtualfilesandisgenerallymountedtothe/sys/debugdirectory.Debugfsisused
bytracingmechanismssuchasftracetopresentfunctionandinterrupttraces.
Therearemanyotherspecialfilesystemssuchaspipefs,mqueue,andsockfs;we
shalltouchuponafewoftheminlaterchapters.
Summary
Throughthischapter,wehavegainedagenericunderstandingofatypical
filesystem,itsfabricanddesign,andwhatmakesitanelementalpartofan
operatingsystem.Thischapteralsoemphasizestheimportanceandeleganceof
abstraction,usingthecommon,layeredarchitecturedesignwhichthekernel
comprehensivelyimbibes.WehavealsostretchedourunderstandingoftheVFS
anditscommonfileinterfacethatfacilitatesthecommonfileAPIanditsinternal
structures.Inthenextchapter,wewillshallexploreanotherfacetofmemory
managementcalledavirtualmemorymanagerthatdealswithprocessvirtual
addressspacesandpagetables.
InterprocessCommunication
Acomplexapplication-programmingmodelmightincludeanumberof
processes,eachimplementedtohandleaspecificjob,whichcontributetothe
endfunctionalityoftheapplicationasawhole.Dependingontheobjective,
design,andenvironmentinwhichsuchapplicationsarehosted,processes
involvedmightberelated(parent-child,siblings)orunrelated.Often,such
processesneedvariousresourcestocommunicate,sharedata,andsynchronize
theirexecutiontoachievedesiredresults.Theseareprovidedbytheoperating
system'skernelasservicescalledinterprocesscommunication(IPC).Wehave
alreadydiscussedtheusageofsignalsasanIPCmechanism;inthischapter,we
shallbegintoexplorevariousotherresourcesavailableforprocess
communicationanddatasharing.
Inthischapterwewillcoverthefollowingtopics:
PipesandFIFOsasmessagingresources
SysVIPCresources
POSXIPCmechanisms
PipesandFIFOs
Pipesformabasicunidirectional,self-synchronousmeansofcommunication
betweenprocesses.Asthenamesuggests,theyhavetwoends:onewherea
processwritesandtheoppositeendfromwhereanotherprocessreadsthedata.
Presumablywhatgoesinfirstwillbereadoutfirstinthiskindofasetup.Pipes
innatelyresultincommunicationsynchronizationduetotheirlimitedcapacity:if
thewritingprocesswritesmuchfasterthanthereadingprocessreads,thepipe’s
capacitywillfailtoholdexcessdataandinvariablyblockthewritingprocess
untilthereaderreadsandfreesupdata.Similarly,ifthereaderreadsdatafaster
thanthewriter,itwillbeleftwithnodatatoread,thusbeingblockeduntildata
becomesavailable.
Pipescanbeusedasamessagingresourceforbothcasesofcommunication:
betweenrelatedprocessesandbetweenunrelatedprocesses.Whenapplied
betweenrelatedprocesses,pipesarereferredtoasunnamedpipes,sincethey
arenotenumeratedasfilesundertherootfstree.Anunnamedpipecanbe
allocatedthroughthepipe()API.
intpipe2(intpipefd[2],intflags);
APIinvokesacorrespondingsystemcall,whichallocatesappropriatedata
structuresandsetsuppipebuffers.Itmapsapairoffiledescriptors,onefor
readingonthepipebufferandanotherforwritingonthepipebuffer.These
descriptorsarereturnedtothecaller.Thecallerprocessnormallyforksthechild
process,whichinheritsthepipefiledescriptorsthatcanbeusedformessaging.
Thefollowingcodeexcerptshowsthepipesystemcallimplementation:
SYSCALL_DEFINE2(pipe2,int__user*,fildes,int,flags)
{
structfile*files[2];
intfd[2];
interror;
error=__do_pipe_flags(fd,files,flags);
if(!error){
if(unlikely(copy_to_user(fildes,fd,sizeof(fd)))){
fput(files[0]);
fput(files[1]);
put_unused_fd(fd[0]);
put_unused_fd(fd[1]);
error=-EFAULT;
}else{
fd_install(fd[0],files[0]);
fd_install(fd[1],files[1]);
}
}
returnerror;
}
Communicationbetweenunrelatedprocessesrequiresthepipefiletobe
enumeratedintorootfs.Suchpipesareoftencallednamedpipes,andcanbe
createdeitherfromthecommandline(mkfifo)orfromaprocessusingthemkfifo
API.
intmkfifo(constchar*pathname,mode_tmode);
Anamedpipeiscreatedwiththenamespecifiedandwithappropriate
permissionsasspecifiedbythemodeargument.Themknodsystemcallisinvoked
forcreatingaFIFO,whichinternallyinvokesVFSroutinestosetupthenamed
pipe.ProcesseswithaccesspermissionscaninitiateoperationsonFIFOsthrough
commonVFSfileAPIsopen,read,write,andclose.
staticstructfile_system_typepipe_fs_type={<br/>.name="pipefs",
<br/>.mount=pipefs_mount,<br/>.kill_sb=kill_anon_super,
<br/>};<br/><br/>staticint__initinit_pipe_fs(void)<br/>{<br/>int
err=register_filesystem(&pipe_fs_type);<br/><br/>if(!err){<br/>
pipe_mnt=kern_mount(&pipe_fs_type);<br/>if
(IS_ERR(pipe_mnt)){<br/>err=PTR_ERR(pipe_mnt);<br/>
unregister_filesystem(&pipe_fs_type);<br/>}<br/>}<br/>returnerr;
<br/>}<br/><br/>fs_initcall(init_pipe_fs);
structinode{<br/>umode_ti_mode;<br/>unsignedshorti_opflags;
<br/>kuid_ti_uid;<br/>kgid_ti_gid;<br/>unsignedinti_flags;<br/>
...<br/>...<br/>...<br/>union{<br/><strong>structpipe_inode_info
*i_pipe;</strong><br/>structblock_device*i_bdev;<br/>structcdev
*i_cdev;<br/>char*i_link;<br/>unsignedi_dir_seq;<br/>};<br/>...
<br/>...<br/>...<br/>};
structpipe_inode_info{<br/>structmutexmutex;<br/>
wait_queue_head_twait;<br/>unsignedintnrbufs,curbuf,buffers;
<br/>unsignedintreaders;<br/>unsignedintwriters;<br/>unsigned
intfiles;<br/>unsignedintwaiting_writers;<br/>unsignedint
r_counter;<br/>unsignedintw_counter;<br/>structpage*tmp_page;
<br/>structfasync_struct*fasync_readers;<br/>structfasync_struct
*fasync_writers;<br/>structpipe_buffer*bufs;<br/>struct
user_struct*user;<br/>};
structpipe_buffer{<br/>structpage*page;<br/>unsignedintoffset,
len;<br/>conststructpipe_buf_operations*ops;<br/>unsignedint
flags;<br/>unsignedlongprivate;<br/>};
conststructfile_operationspipefifo_fops={<br/>.open=fifo_open,
<br/>.llseek=no_llseek,<br/>.read_iter=pipe_read,<br/>
.write_iter=pipe_write,<br/>.poll=pipe_poll,<br/>.unlocked_ioctl
=pipe_ioctl,<br/>.release=pipe_release,<br/>.fasync=
pipe_fasync,<br/>};
Messagequeues
Messagequeuesarelistsofmessagebuffersthroughwhichanarbitrarynumber
ofprocessescancommunicate.Unlikepipes,thewriterdoesnothavetowaitfor
thereadertoopenthepipeandlistenfordata.Similartoamailbox,writerscan
dropafixed-lengthmessagewrappedinabufferintothequeue,whichthereader
canpickwheneveritisready.Themessagequeuedoesnotretainthemessage
packetafteritispickedbythereader,whichmeansthateachmessagepacketis
assuredtobeprocesspersistent.Linuxsupportstwodistinctimplementationsof
messagequeues:classicUnixSYSVmessagequeuesandcontemporaryPOSIX
messagequeues.
SystemVmessagequeues
ThisistheclassicAT&Tmessagequeueimplementationsuitableformessaging
betweenanarbitrarynumberofunrelatedprocesses.Senderprocesseswrapeach
messageintoapacketcontainingmessagedataandamessagenumber.The
messagequeueimplementationdoesnotdefinethemeaningofthemessage
number,anditislefttotheapplicationdesignerstodefineappropriatemeanings
formessagenumbersandprogramreadersandwriterstointerpretthesame.This
mechanismprovidesflexibilityforprogrammerstousemessagenumbersas
messageIDsorreceiverIDs.Itenablesreaderprocessestoselectivelyread
messagesthatmatchspecificIDs.However,messageswiththesameIDare
alwaysreadinFIFOorder(firstin,firstout).
ProcessescancreateandopenaSysVmessagequeuewith:
intmsgget(key_tkey,intmsgflg);
Thekeyparameterisauniqueconstantthatservesasamagicnumbertoidentify
themessagequeue.Allprogramsthatarerequiredtoaccessthismessagequeue
willneedtousethesamemagicnumber;thisnumberisusuallyhard-codedinto
relevantprocessesatcompiletime.However,applicationsneedtoensurethatthe
keyvalueisuniqueforeachmessagequeue,andtherearealternatelibrary
functionsavailablethroughwhichuniquekeyscanbedynamicallygenerated.
Theuniquekeyandmsgflagparametervalues,ifsettoIPC_CREATE,willcauseanew
messagequeuetobesetup.Validprocessesthathaveaccesstothequeuecan
readorwritemessagesintothequeueusingmsgsndandmsgrcvroutines(wewill
notdiscussthemindetailhere;refertoLinuxsystemprogrammingmanuals):
intmsgsnd(intmsqid,constvoid*msgp,size_tmsgsz,intmsgflg);
ssize_tmsgrcv(intmsqid,void*msgp,size_tmsgsz,longmsgtyp,
intmsgflg);
Datastructures
Eachmessagequeueiscreatedbyenumeratingasetofdatastructuresbythe
underlyingSysVIPCsubsystem.structmsg_queueisthecoredatastructure,andan
instanceofthisisenumeratedforeachmessagequeue:
structmsg_queue{
structkern_ipc_permq_perm;
time_tq_stime;/*lastmsgsndtime*/
time_tq_rtime;/*lastmsgrcvtime*/
time_tq_ctime;/*lastchangetime*/
unsignedlongq_cbytes;/*currentnumberofbytesonqueue*/
unsignedlongq_qnum;/*numberofmessagesinqueue*/
unsignedlongq_qbytes;/*maxnumberofbytesonqueue*/
pid_tq_lspid;/*pidoflastmsgsnd*/
pid_tq_lrpid;/*lastreceivepid*/
structlist_headq_messages;/*messagelist*/
structlist_headq_receivers;/*readerprocesslist*/
structlist_headq_senders;/*writerprocesslist*/
};
Theq_messagesfieldrepresentstheheadnodeofadouble-linkedcircularlistthat
containsallmessagescurrentlyinthequeue.Eachmessagebeginswithaheader
followedbymessagedata;eachmessagecanconsumeoneofmorepages
dependingonlengthofmessagedata.Themessageheaderisalwaysatthestart
ofthefirstpageandisrepresentedbyaninstanceofstructmsg_msg:
/*onemsg_msgstructureforeachmessage*/
structmsg_msg{
structlist_headm_list;
longm_type;
size_tm_ts;/*messagetextsize*/
structmsg_msgseg*next;
void*security;
/*theactualmessagefollowsimmediately*/
};
Them_listfieldcontainspointerstopreviousandnextmessagesinthequeue.
The*nextpointerreferstoaninstanceoftypestructmsg_msgseg,whichcontainsthe
addressofthenextpageofmessagedata.Thispointerisrelevantonlywhen
messagedataexceedsthefirstpage.Thesecondpageframestartswitha
descriptormsg_msgseg,whichfurthercontainsapointertoasubsequentpage,and
thisordercontinuesuntilthelastpageofthemessagedataisreached:
structmsg_msgseg{
structmsg_msgseg*next;
/*thenextpartofthemessagefollowsimmediately*/
};
structmqueue_inode_info{<br/>spinlock_tlock;<br/>structinode
vfs_inode;<br/>wait_queue_head_twait_q;<br/><br/>structrb_root
msg_tree;<br/>structposix_msg_tree_node*node_cache;<br/>struct
mq_attrattr;<br/><br/>structsigeventnotify;<br/>structpid
*notify_owner;<br/>structuser_namespace*notify_user_ns;<br/>
structuser_struct*user;/*userwhocreated,foraccounting*/<br/>
structsock*notify_sock;<br/>structsk_buff*notify_cookie;<br/>
<br/>/*fortaskswaitingforfreespaceandmessages,respectively
*/<br/>structext_wait_queuee_wait_q[2];<br/><br/>unsignedlong
qsize;/*sizeofqueueinmemory(sumofallmsgs)*/<br/>};
The*node_cachepointerreferstotheposix_msg_tree_node
descriptorthatcontainstheheadertoalinkedlistofmessagenodes,in
whicheachmessageisrepresentedbyadescriptoroftypemsg_msg:
structposix_msg_tree_node{
structrb_noderb_node;
structlist_headmsg_list;
intpriority;
};
Sharedmemory
Unlikemessagequeues,whichofferaprocess-persistentmessaging
infrastructure,thesharedmemoryserviceofIPCprovideskernel-persistent
memorythatcanbeattachedbyanarbitrarynumberofprocessesthatshare
commondata.Asharedmemoryinfrastructureprovidesoperationinterfacesto
allocate,attach,detach,anddestroysharedmemoryregions.Aprocessthat
needsaccesstoshareddatawillattachormapasharedmemoryregionintoits
addressspace;itcanthenaccessdatainsharedmemorythroughtheaddress
returnedbythemappingroutine.Thismakessharedmemoryoneofthefastest
meansofIPCsincefromaprocess'sperspectiveitisakintoaccessinglocal
memory,whichdoesnotinvolveswitchintokernelmode.
SystemVsharedmemory
LinuxsupportslegacySysVsharedmemoryimplementationundertheIPC
subsystem.SimilartoSysVmessagequeues,eachsharedmemoryregionis
identifiedbyauniqueIPCidentifier.
Operationinterfaces
Thekernelprovidesdistinctsystemcallinterfacesforinitiatingsharedmemory
operationsasfollows:
intshmget(key_tkey,size_tsize,intshmflg);
Thisfunctionreturnstheidentifierofthesharedmemorysegment
correspondingtothevaluecontainedinthekeyparameter.Ifother
processesintendtouseanexistingsegment,theycanusethe
segment'skeyvaluewhenlookingforitsidentifier.Anewsegmentis
howevercreatedifthekeyparameterisuniqueorhasthevalue
IPC_PRIVATE.
sizeindicatesthenumberofbytesthatneedstobeallocated,as
segmentsareallocatedasmemorypages.Thenumberofpagestobe
allocatedisobtainedbyroundingoffthesizevaluetothenearest
multipleofapagesize.\
Theshmflgflagspecifieshowthesegmentneedstobecreated.Itcan
containtwovalues:
IPC_CREATE:Thisindicatescreatinganewsegment.Ifthisflagis
unused,thesegmentassociatedwiththekeyvalueisfound,andif
theuserhastheaccesspermissions,thesegment'sidentifieris
returned.
IPC_EXCL:ThisflagisalwaysusedwithIPC_CREAT,toensurethat
thecallfailsifthekeyvalueexists.
void*shmat(intshmid,constvoid*shmaddr,intshmflg);
Thesegmentindicatedbyshmidisattachedbythisfunction.shmaddr
specifiesapointerindicatingthelocationintheprocess'saddress
spacewherethesegmentistobemapped.Thethirdargumentshmflg
isaflag,whichcanbeoneofthefollowing:
SHM_RND:Thisisspecifiedwhenshmaddrisn'taNULLvalue,
indicatingthefunctiontoattachthesegmentattheaddress,
computedbyroundingofftheshmaddrvaluetothenearest
multipleofpagesize;otherwise,theusermusttakecarethat
shmaddrbepage-alignedsothatthesegmentgetsattached
correctly.
SHM_RDONLY:Thisistospecifythatthesegmentwillonlyberead
iftheuserhasthenecessaryreadpermissions.Otherwise,both
readandwriteaccessforthesegmentisgiven(theprocessmust
havetherespectivepermissions).
SHM_REMAP:ThisisaLinux-specificflagthatindicatesthatany
existingmappingattheaddressspecifiedbyshmaddrbereplaced
withthenewmapping.
Detachingsharedmemory
Likewise,todetachthesharedmemoryfromtheprocessaddressspace,shmdt()is
invoked.AsIPCsharedmemoryregionsarepersistentinthekernel,they
continuetoexistevenaftertheprocessesdetach:intshmdt(constvoid
*shmaddr);
Thesegmentattheaddressspecifiedbyshmaddrisdetachedfromtheaddress
spaceofthecallingprocess.
Eachoftheseinterfaceoperationsinvokerelevantsystemcallsimplementedin
the<ipc/shm.c>sourcefile.
Datastructures
Eachsharedmemorysegmentisrepresentedbyastructshmid_kerneldescriptor.
ThisstructurecontainsallmetadatarelevanttothemanagementofSysVshared
memory:
structshmid_kernel/*privatetothekernel*/
{
structkern_ipc_permshm_perm;
structfile*shm_file;/*pointertosharedmemoryfile*/
unsignedlongshm_nattch;/*noofattachedprocess*/
unsignedlongshm_segsz;/*indexintothesegment*/
time_tshm_atim;/*lastaccesstime*/
time_tshm_dtim;/*lastdetachtime*/
time_tshm_ctim;/*lastchangetime*/
pid_tshm_cprid;/*pidofcreatingprocess*/
pid_tshm_lprid;/*pidoflastaccess*/
structuser_struct*mlock_user;
/*Thetaskcreatedtheshmobject.NULLifthetaskisdead.*/
structtask_struct*shm_creator;
structlist_headshm_clist;/*listbycreator*/
};
Forreliabilityandeaseofmanagement,thekernel'sIPCsubsystemmanages
sharedmemorysegmentsthroughaspecialfilesystemcalledshmfs.This
filesystemisnotmountedontotherootfstree;itsoperationsareonlyaccessible
throughSysVsharedmemorysystemcalls.The*shm_filepointerreferstothe
structfileobjectofshmfsthatrepresentsasharedmemoryblock.Whenaprocess
initiatesanattachoperation,theunderlyingsystemcallinvokesdo_mmap()to
createrelevantmappingintothecaller'saddressspace(throughstruct
vm_area_struct)andstepsintotheshmfs-definedshm_mmap()operationtomap
correspondingsharedmemory:
POSIXsharedmemory
TheLinuxkernelsupportsPOSIXsharedmemorythroughaspecialfilesystem
calledtmpfs,whichismountedonto/dev/shmoftherootfs.Thisimplementation
offersadistinctAPIwhichisconsistentwiththeUnixfilemodel,resultingin
eachsharedmemoryallocationtoberepresentedbyauniquefilenameand
inode.Thisinterfaceisconsideredmoreflexiblebyapplicationprogrammers
sinceitallowsstandardPOSIXfile-mappingroutinesmmap()andunmap()for
attachinganddetachingmemorysegmentsintothecallerprocessaddressspace.
Followingisasummarizeddescriptionofinterfaceroutines:
API Description
shm_open() Createandopenasharedmemorysegmentidentifiedbya
filename
mmap() POSIXstandardfilemappinginterfaceforattachingshared
memorytocaller'saddressspace
sh_unlink() Destroyspecifiedsharedmemoryblock
unmap() Detachspecifiedsharedmemorymapfromcalleraddressspace
TheunderlyingimplementationissimilartothatofSysVsharedmemorywith
thedifferencethatthemappingimplementationishandledbythetmpfs
filesystem.
Althoughsharedmemoryistheeasiestwayofsharingcommondataor
resources,itdumpstheburdenofimplementingsynchronizationonthe
processes,asasharedmemoryinfrastructuredoesnotprovideany
synchronizationorprotectionmechanismforthedataorresourcesintheshared
memoryregion.Anapplicationdesignermustconsidersynchronizationof
sharedmemoryaccessbetweencontendingprocessestoensurereliabilityand
validityofshareddata,forinstance,preventingapossiblewritebytwoprocesses
onthesameregionatatime,restrictingareadingprocesstowaituntilawriteis
completedbyanotherprocess,andsoon.Often,tosynchronizesuchrace
conditionsanotherIPCresourcecalledsemaphoresisused.
Semaphores
SemaphoresaresynchronizationprimitivesprovidedbytheIPCsubsystem.
Theydeliveraprotectivemechanismforshareddatastructuresorresources
againstconcurrentaccessbyprocessesinamultithreadedenvironment.Atits
core,eachsemaphoreiscomposedofanintegercounterthatcanbeatomically
accessedbyacallerprocess.Semaphoreimplementationsprovidetwo
operations,oneforwaitingonasemaphorevariableandanothertosignalthe
semaphorevariable.Inotherwords,waitingonthesemaphoredecreasesthe
counterby1andsignalingthesemaphoreincreasesthecounterby1.Typically,
whenaprocesswantstoaccessasharedresource,ittriestodecreasethe
semaphorecounter.Thisattemptishoweverhandledbythekernelasitblocks
theattemptingprocessuntilthecounteryieldsapositivevalue.Similarly,whena
processrelinquishestheresource,itincreasesthesemaphorecounter,which
wakesupanyprocessthatiswaitingfortheresource.
Semaphoreversions
Traditionallyall*nixsystemsimplementtheSystemVsemaphoremechanism;
however,POSIXhasitsownimplementationofsemaphoresaimingat
portabilityandlevelingafewclumsyissueswhichtheSystemVversioncarries.
Let’sbeginbylookingatSystemVsemaphores.
SystemVsemaphores
SemaphoresinSystemVarenotjustasinglecounterasyoumightthink,but
ratherasetofcounters.Thisimpliesthatasemaphoresetcancontainsingleor
multiplecounters(0ton)withanidenticalsemaphoreID.Eachcounterinthe
setcanprotectasharedresource,andasinglesemaphoresetcanprotectmultiple
resources.Thesystemcallthathelpscreatethiskindofsemaphoreisasfollows:
intsemget(key_tkey,intnsems,intsemflg)
keyisusedtoidentifythesemaphore.IfthekeyvalueisIPC_PRIVATE,anew
setofsemaphoresiscreated.
nsemsindicatesthesemaphoresetwiththenumberofcountersneededinthe
set
semflgdictateshowthesemaphoreshouldbecreated.Itcancontaintwo
values:
IPC_CREATE:Ifthekeydoesnotexist,itcreatesanewsemaphore
IPC_EXCL:Ifthekeyexists,itthrowsanerrorandfails
Onsuccess,thecallreturnsthesemaphoresetidentifier(apositivevalue).
Asemaphorethuscreatedcontainsuninitializedvaluesandrequiresthe
initializationtobecarriedoutusingthesemctl()function.Afterinitialization,the
semaphoresetcanbeusedbytheprocesses:
intsemop(intsemid,structsembuf*sops,unsignednsops);
TheSemop()functionletstheprocessinitiateoperationsonthesemaphoreset.
ThisfunctionoffersafacilityuniquetotheSysVsemaphoreimplementation
calledundoableoperationsthroughaspecialflagcalledSEM_UNDO.Whenthisflag
isset,thekernelallowsasemaphoretoberestoredtoaconsistentstateifa
processabortsbeforecompletingtherelevantshareddataaccessoperation.For
instance,consideracasewhereoneoftheprocesseslocksthesemaphoreand
beginsitsaccessoperationsonshareddata;duringthistimeiftheprocessaborts
beforecompletionofshareddataaccess,thesemaphorewillbeleftinan
inconsistentstate,makingitunavailableforothercontendingprocesses.
However,iftheprocesshadacquiredalockonthesemaphorebysettingthe
SEM_UNDOflagwithsemop(),itsterminationwouldallowthekerneltorevertthe
semaphoretoaconsistentstate(unlockedstate)makingitavailableforother
contendingprocessesinwait.
Datastructures
EachSysVsemaphoresetisrepresentedinthekernelbyadescriptoroftype
structsem_array:
/*Onesem_arraydatastructureforeachsetofsemaphoresinthesystem.*/
structsem_array{
structkern_ipc_perm____cacheline_aligned_in_smpsem_perm;
time_tsem_ctime;/*lastchangetime*/
structsem*sem_base;/*ptrtofirstsemaphoreinarray*/
structlist_headpending_alter;/*pendingoperations*/
/*thatalterthearray*/
structlist_headpending_const;/*pendingcomplexoperations*/
/*thatdonotaltersemvals*/
structlist_headlist_id;/*undorequestsonthisarray*/
intsem_nsems;/*no.ofsemaphoresinarray*/
intcomplex_count;/*pendingcomplexoperations*/
boolcomplex_mode;/*noparallelsimpleops*/
};
Eachsemaphoreinthearrayisenumeratedasaninstanceofstructsemdefinedin
<ipc/sem.c>;the*sem_basepointerreferstothefirstsemaphoreobjectintheset.
;Eachsemaphoresetcontainsalistofpendingqueueperprocesswaiting;
pending_alteristheheadnodeforthispendingqueueoftypestructsem_queue.Each
semaphoresetalsocontainsper-semaphoreundoableoperations.list_idisahead
nodetoalistofstructsem_undoinstances;thereisoneinstanceinthelistforeach
semaphoreintheset.Thefollowingdiagramsumsupthesemaphoresetdata
structureanditslists:
POSIXsemaphores
POSIXsemaphoresemanticsarerathersimplewhencomparedtoSystemV.
Eachsemaphoreisasimplecounterthatcanneverbelessthanzero.The
implementationprovidesfunctioninterfacesforinitialization,increment,and
decrementoperations.Theycanbeusedforsynchronizingthreadsbyallocating
thesemaphoreinstanceinmemoryaccessibletoallthethreads.Theycanalsobe
usedforsynchronizingprocessesbyplacingthesemaphoreinsharedmemory.
LinuximplementationofPOSIXsemaphoresisoptimizedtodeliverbetter
performancefornon-contendingsynchronizationscenarios.
POSIXsemaphoresareavailableintwovariants:namedsemaphoresand
unnamedsemaphores.Anamedsemaphoreisidentifiedbyafilenameandis
suitableforusebetweenunrelatedprocesses.Anunnamedsemaphoreisjusta
globalinstanceoftypesem_t;thisformisgenerallypreferredforusebetween
threads.POSIXsemaphoreinterfaceoperationsarepartofthePOSIXthreads
libraryimplementation.
Function
interfaces Description
sem_open() Opensanexistingnamedsemaphorefileorcreatesanew
namedsemaphoreandreturnsitsdescriptor
sem_init() Initializerroutineforanunnamedsemaphore
sem_post() Operationtoincrementsemaphore
sem_wait() Operationtodecrementsemaphore,blocksifinvokedwhen
semaphorevalueiszero
sem_timedwait() Extendssem_wait()withatimeoutparameterforboundedwait
sem_getvalue() Returnsthecurrentvalueofthesemaphorecounter
sem_unlink()
Removesanamedsemaphoreidentifiedwithafile
Summary
Inthischapter,wetouchedonvariousIPCmechanismsofferedbythekernel.
Weexploredthelayoutandrelationshipbetweenvariousdatastructuresforeach
mechanism,andalsolookedatbothSysVandPOSIXIPCmechanisms.
Inthenextchapter,wewilltakethisdiscussionfurtherintolockingandkernel-
synchronizationmechanisms.
VirtualMemoryManagement
Inthefirstchapter,wehadbriefdiscussionaboutanimportantabstractioncalled
aprocess.Wehaddiscussedtheprocessvirtualaddressspaceanditsisolation,
andalsohavetraversedthoroughthememorymanagementsubsystemand
gainedathoroughunderstandingofvariousdatastructuresandalgorithmsthat
gointophysicalmemorymanagement.Inthischapter,let'sextendourdiscussion
onmemorymanagementwithdetailsofvirtualmemorymanagementandpage
tables.Wewilllookintothefollowingaspectsofthevirtualmemorysubsystem:
Processvirtualaddressspaceanditssegments
Memorydescriptorstructure
MemorymappingandVMAobjects
File-backedmemorymappings
Pagecache
Addresstranslationwithpagetables
Processaddressspace
Thefollowingdiagramdepictsthelayoutofatypicalprocessaddressspacein
Linuxsystems,whichiscomposedofasetofvirtualmemorysegments:
Eachsegmentisphysicallymappedtooneormorelinearmemoryblocks(made
outofoneormorepages),andappropriateaddresstranslationrecordsareplaced
inaprocesspagetable.Beforewegetintothecompletedetailsofhowthekernel
managesmemorymapsandconstructspagetables,let'sunderstandinbriefeach
segmentoftheaddressspace:
Stackisthetopmostsegment,whichexpandsdownward.Itcontainsstack
framesthatholdlocalvariablesandfunctionparameters;anewframeis
createdontopofthestackuponentryintoacalledfunction,andis
destroyedwhenthecurrentfunctionreturns.Dependingonthelevelof
nestingofthefunctioncalls,thereisalwaysaneedforthestacksegmentto
dynamicallyexpandtoaccommodatenewframes.Suchexpansionis
handledbythevirtualmemorymanagerthroughpagefaults:whenthe
processattemptstotouchanunmappedaddressatthetopofthestack,the
systemtriggersapagefault,whichishandledbythekerneltocheck
whetheritisappropriatetogrowthestack.Ifthecurrentstackutilizationis
withinRLIMIT_STACK,thenitisconsideredappropriateandthestackis
expanded.However,ifthecurrentutilizationismaximumwithnofurther
scopetoexpand,thenasegmentationfaultsignalisdeliveredtothe
process.
Mmapisasegmentbelowthestack;thissegmentisprimarilyusedfor
mappingfiledatafrompagecacheintoprocessaddressspace.Thissegment
isalsousedformappingsharedobjectsordynamiclibraries.User-mode
processescaninitiatenewmappingsthroughthemmap()API.TheLinux
kernelalsosupportsanonymousmemorymappingthroughthissegment,
whichservesasanalternativemechanismfordynamicmemoryallocations
tostoreprocessdata.
Heapsegmentprovidesaddressspacefordynamicmemoryallocationthat
allowsaprocesstostoreruntimedata.Thekernelprovidesthebrk()family
ofAPIs,throughwhichuser-modeprocessescanexpandorshrinktheheap
atruntime.However,mostprogramming-language-specificstandard
librariesimplementheapmanagementalgorithmsforefficientutilizationof
heapmemory.Forinstance,GNUglibcimplementsheapmanagementthat
offersthemalloc()familyoffunctionsforallocations.
Thelowersegmentsoftheaddressspace--BSS,Data,andText--arerelatedto
thebinaryimageoftheprocess:
TheBSSstoresuninitializedstaticvariables,whosevaluesarenot
initializedintheprogramcode.TheBSSissetupthroughanonymous
memorymapping.
Thedatasegmentcontainsglobalandstaticvariablesinitializedinprogram
sourcecode.Thissegmentisenumeratedbymappingpartoftheprogram
binaryimagethatcontainsinitializeddata;thismappingiscreatedoftype
privatememorymapping,whichensuresthatchangestodatavariables'
memoryarenotreflectedonthediskfile.
Thetextsegmentisalsoenumeratedbymappingtheprogrambinaryfile
frommemory;thismappingisoftypeRDONLY,resultinginasegmentation
faulttobetriggeredonanattempttowriteintothissegment.
Thekernelsupportstheaddressspacerandomizationfacility,whichifenabled
duringbuildallowstheVMsubsystemtorandomizestartlocationsforstack,
mmap,andheapsegmentsforeachnewprocess.Thisprovidesprocesseswith
much-neededsecurityfrommaliciousprogramsthatarecapableofinjecting
faults.Hackerprogramsaregenerallyhard-codedwithfixedstartaddressesof
memorysegmentsofavalidprocess;withaddressspacerandomization,such
maliciousattackswouldfail.However,textsegmentsenumeratedfromthe
binaryfileoftheapplicationprogramaremappedtoafixedaddressasperthe
definitionoftheunderlyingarchitecture;thisisconfiguredintothelinkerscript,
whichisappliedwhileconstructingtheprogrambinaryfile.
Processmemorydescriptor
Thekernelmaintainsallinformationonprocessmemorysegmentsandthe
correspondingtranslationtableinamemorydescriptorstructure,whichisof
typestructmm_struct.Theprocessdescriptorstructuretask_structcontainsa
pointer*mmtothememorydescriptorfortheprocess.Weshalldiscussafew
importantelementsofthememorydescriptorstructure:
structmm_struct{
structvm_area_struct*mmap;/*listofVMAs*/
structrb_rootmm_rb;
u32vmacache_seqnum;/*per-threadvmacache*/
#ifdefCONFIG_MMU
unsignedlong(*get_unmapped_area)(structfile*filp,unsignedlongaddr,unsignedlonglen,
unsignedlongpgoff,unsignedlongflags);
#endif
unsignedlongmmap_base;/*baseofmmaparea*/
unsignedlongmmap_legacy_base;/*baseofmmapareainbottom-upallocations*/
unsignedlongtask_size;/*sizeoftaskvmspace*/
unsignedlonghighest_vm_end;/*highestvmaendaddress*/
pgd_t*pgd;
atomic_tmm_users;/*Howmanyuserswithuserspace?*/
atomic_tmm_count;/*Howmanyreferencesto"structmm_struct"(userscountas1)*/
atomic_long_tnr_ptes;/*PTEpagetablepages*/
#ifCONFIG_PGTABLE_LEVELS>2
atomic_long_tnr_pmds;/*PMDpagetablepages*/
#endif
intmap_count;/*numberofVMAs*/
spinlock_tpage_table_lock;/*Protectspagetablesandsomecounters*/
structrw_semaphoremmap_sem;
structlist_headmmlist;/*Listofmaybeswappedmm's.Thesearegloballystrung
*togetheroffinit_mm.mmlist,andareprotected
*bymmlist_lock
*/
unsignedlonghiwater_rss;/*High-watermarkofRSSusage*/
unsignedlonghiwater_vm;/*High-watervirtualmemoryusage*/
unsignedlongtotal_vm;/*Totalpagesmapped*/
unsignedlonglocked_vm;/*PagesthathavePG_mlockedset*/
unsignedlongpinned_vm;/*Refcountpermanentlyincreased*/
unsignedlongdata_vm;/*VM_WRITE&~VM_SHARED&~VM_STACK*/
unsignedlongexec_vm;/*VM_EXEC&~VM_WRITE&~VM_STACK*/
unsignedlongstack_vm;/*VM_STACK*/
unsignedlongdef_flags;
unsignedlongstart_code,end_code,start_data,end_data;
unsignedlongstart_brk,brk,start_stack;
unsignedlongarg_start,arg_end,env_start,env_end;
unsignedlongsaved_auxv[AT_VECTOR_SIZE];/*for/proc/PID/auxv*/
/*
*Specialcounters,insomeconfigurationsprotectedbythe
*page_table_lock,inotherconfigurationsbybeingatomic.
*/
structmm_rss_statrss_stat;
structlinux_binfmt*binfmt;
cpumask_var_tcpu_vm_mask_var;
/*Architecture-specificMMcontext*/
mm_context_tcontext;
unsignedlongflags;/*Mustuseatomicbitopstoaccessthebits*/
structcore_state*core_state;/*coredumpingsupport*/
...
...
...
};
mmap_basereferstothestartofthemmapsegmentinthevirtualaddressspace,and
task_sizecontainsthetotalsizeofthetaskinthevirtualmemoryspace.mm_usersis
anatomiccounterthatholdsthecountofLWPsthatsharethismemory
descriptor,mm_countholdsthecountofthenumberofprocessescurrentlyusing
thisdescriptor,andtheVMsubsystemensuresthatamemorydescriptor
structureisonlyreleasedwhenmm_countiszero.Thestart_codeandend_codefields
containthestartandendvirtualaddressesforthecodeblockmappedfromthe
program'sbinaryfile.Similarly,start_dataandend_datamarkthebeginningand
endoftheinitializeddataregionmappedfromtheprogram'sbinaryfile.
Thestart_brkandbrkfieldsrepresentthestartandcurrentendaddressesofthe
heapsegment;whilestart_brkremainsconstantthroughouttheprocesslifetime,
brkisre-positionedwhileallocatingandreleasingheapmemory.Therefore,the
totalsizeoftheactiveheapatagivenmomentintimeisthesizeofthememory
betweenthestart_brkandbrkfields.Theelementsarg_startandarg_endcontain
locationsofthecommand-lineargumentlist,andenv_startandenv_endcontainthe
startandendlocationsforenvironmentvariables:
Eachlinearmemoryregionmappedtoasegmentinvirtualaddressspaceis
representedthroughadescriptoroftypestructvm_area_struct.EachVMarea
regionismappedwithavirtualaddressintervalthatcontainsastartandend
virtualaddressesalongwithotherattributes.TheVMsubsystemmaintainsa
linkedlistofvm_area_struct(VMA)nodesrepresentingcurrentregions;thislistis
sortedinascendingorder,withthefirstnoderepresentingthestartvirtual
addressintervalandthenodethatfollowscontainingthenextaddressinterval,
andsoon.Thememorydescriptorstructureincludesapointer*mmap,whichrefers
tothislistofVMareascurrentlymapped.
TheVMsubsystemwillneedtoscanthevm_arealistwhileperformingvarious
operationsonVMregionssuchaslookingforaspecificaddresswithinmapped
addressintervals,orappendinganewVMAinstancerepresentinganew
mapping.Suchoperationscouldbetimeconsumingandinefficientespeciallyfor
caseswherealargenumberofregionsaremappedintothelist.Asa
workaround,theVMsubsystemmaintainsared-blacktreeforefficientaccessof
vm_areaobjects.Thememorydescriptorstructureincludestherootnodeofthe
red-blacktreemm_rb.Withthisarrangement,newVMregionscanbequickly
appendedbysearchingthered-blacktreefortheregionprecedingtheaddress
intervalforthenewregion;thiseliminatestheneedtoexplicitlyscanthelinked
list.
structvm_area_structisdefinedinthekernelheader<linux/mm_types.h>:
/*
*ThisstructdefinesamemoryVMMmemoryarea.Thereisoneofthese
*perVM-area/task.AVMareaisanypartoftheprocessvirtualmemory
*spacethathasaspecialruleforthepage-faulthandlers(ieashared
*library,theexecutableareaetc).
*/
structvm_area_struct{
/*ThefirstcachelinehastheinfoforVMAtreewalking.*/
unsignedlongvm_start;/*Ourstartaddresswithinvm_mm.*/
unsignedlongvm_end;/*Thefirstbyteafterourendaddresswithinvm_mm.*/
/*linkedlistofVMareaspertask,sortedbyaddress*/
structvm_area_struct*vm_next,*vm_prev;
structrb_nodevm_rb;
/*
*LargestfreememorygapinbytestotheleftofthisVMA.
*EitherbetweenthisVMAandvma->vm_prev,orbetweenoneofthe
*VMAsbelowusintheVMArbtreeandits->vm_prev.Thishelps
*get_unmapped_areafindafreeareaoftherightsize.
*/
unsignedlongrb_subtree_gap;
/*Secondcachelinestartshere.*/
structmm_struct*vm_mm;/*Theaddressspacewebelongto.*/
pgprot_tvm_page_prot;/*AccesspermissionsofthisVMA.*/
unsignedlongvm_flags;/*Flags,seemm.h.*/
/*
*Forareaswithanaddressspaceandbackingstore,
*linkageintotheaddress_space->i_mmapintervaltree.
*/
struct{
structrb_noderb;
unsignedlongrb_subtree_last;
}shared;
/*
*Afile'sMAP_PRIVATEvmacanbeinbothi_mmaptreeandanon_vma
*list,afteraCOWofoneofthefilepages.AMAP_SHAREDvma
*canonlybeinthei_mmaptree.AnanonymousMAP_PRIVATE,stack
*orbrkvma(withNULLfile)canonlybeinananon_vmalist.
*/
structlist_headanon_vma_chain;/*Serializedbymmap_sem&page_table_lock*/
structanon_vma*anon_vma;/*Serializedbypage_table_lock*/
/*Functionpointerstodealwiththisstruct.*/
conststructvm_operations_struct*vm_ops;
/*Informationaboutourbackingstore:*/
unsignedlongvm_pgoff;/*Offset(withinvm_file)inPAGE_SIZEunits*/
structfile*vm_file;/*Filewemapto(canbeNULL).*/
void*vm_private_data;/*wasvm_pte(sharedmem)*/
#ifndefCONFIG_MMU
structvm_region*vm_region;/*NOMMUmappingregion*/
#endif
#ifdefCONFIG_NUMA
structmempolicy*vm_policy;/*NUMApolicyfortheVMA*/
#endif
structvm_userfaultfd_ctxvm_userfaultfd_ctx;
};
vm_startcontainsthestartvirtualaddress(loweraddress)oftheregion,whichis
theaddressofthefirstvalidbyteofthemapping,andvm_endcontainsthevirtual
addressofthefirstbytebeyondthemappedregion(higheraddress).Thus,the
lengthofthemappedmemoryregioncanbecomputedbysubtractingvm_start
fromvm_end.Thepointers*vm_nextand*vm_prevrefertothenextandpreviousVMA
list,whilethevm_rbelementisforrepresentingthisVMAunderthered-black
tree.The*vm_mmpointerrefersbacktotheprocessmemorydescriptorstructure.
vm_page_protcontainsaccesspermissionsforthepagesintheregion.vm_flagsisa
bitfieldthatcontainspropertiesformemoryinthemappedregion.Flagbitsare
definedinthekernelheader<linux/mm.h>.
Flagbits Description
VM_NONE Indicatesinactivemapping.
VM_READ Ifset,pagesinthemappedareaarereadable.
VM_WRITE Ifset,pagesinthemappedareaarewritable.
Thisissettomarkamemoryregionasexecutable.Memory
VM_EXEC blockscontainingexecutableinstructionsaresetwiththisflag
alongwithVM_READ.
VM_SHARED Ifset,pagesinthemappedregionareshared.
VM_MAYREAD FlagtoindicatethatVM_READcanbesetonacurrentlymapped
region.Thisflagisforusewiththemprotect()systemcall.
VM_MAYWRITE FlagtoindicatethatVM_WRITEcanbesetonacurrentlymapped
region.Thisflagisforusewiththemprotect()systemcall.
VM_MAYEXEC FlagtoindicatethatVM_EXECcanbesetoncurrentlymapped
region.Thisflagisforusewiththemprotect()systemcall.
VM_GROWSDOWN Mappingcangrowdownward;thestacksegmentisassigned
thisflag.
VM_UFFD_MISSING
ThisflagissettoindicatetoVMsubsystemthatuserfaultfdis
enabledforthismapping,andissettotrackpagemissing
faults.
VM_PFNMAP
Thisflagissettoindicatethatthememoryregionismapped
thoughPFNtrackedpages,unlikeregularpageframeswith
pagedescriptors.
VM_DENYWRITE Settoindicatethatthecurrentfilemappingisnotwritable.
VM_UFFD_WP
ThisflagissettoindicatetotheVMsubsystemthatuserfaultfd
isenabledforthismapping,andissettotrackwrite-protect
faults.
VM_LOCKED Setwhencorrespondingpagesinthemappedmemoryregion
arelocked.
VM_IO SetwhenthedeviceI/Oareaismapped.
VM_SEQ_READ Setwhenaprocessdeclaresitsintentiontoaccessthememory
areawithinthemappedregionsequentially.
VM_RAND_READ Setwhenaprocessdeclaresitsintentiontoaccessthememory
areawithinthemappedregionatrandom.
VM_DONTCOPY SettoindicatetotheVMtodisablecopyingthisVMAon
fork().
VM_DONTEXPAND Settoindicatethatthecurrentmappingcannotexpandon
mremap().
VM_LOCKONFAULT
Lockpagesinthememorymapwhentheyarefaultedin.This
flagissetwhenaprocessenablesMLOCK_ONFAULTwiththemlock2()
systemcall.
VM_ACCOUNT
TheVMsubsystemperformsadditionalcheckstoensurethere
ismemoryavailablewhenperformingoperationsonVMAs
withthisflag.
VM_NORESERVE WhethertheVMshouldsuppressaccounting.
VM_HUGETLB IndicatesthatthecurrentmappingcontainshugeTLBpages.
VM_DONTDUMP Ifset,thecurrentVMAisnotincludedinthecoredump.
VM_MIXEDMAP
SetwhentheVMAmappingcontainsbothtraditionalpage
frames(managedthroughthepagedescriptor)andPFN-
managedpages.
VM_HUGEPAGE
SetwhentheVMAismarkedwithMADV_HUGEPAGEtoinstructthe
VMthatpagesunderthismappingmustbeoftypeTransparent
HugePages(THP).Thisflagworksonlywithprivate
anonymousmappings.
VM_NOHUGEPAGE SetwhentheVMAismarkedwithMADV_NOHUGEPAGE.
VM_MERGEABLE SetwhentheVMAismarkedwithMADV_MERGEABLE,whichenables
thekernelsame-pagemerging(KSM)facility.
VM_ARCH_1 Architecture-specificextensions.
VM_ARCH_2 Architecture-specificextensions.
Thefollowingfiguredepictsthetypicallayoutofavm_arealistaspointedtoby
thememorydescriptorstructureoftheprocess:
Asdepictedhere,somememoryregionsmappedintotheaddressspacearefile-
backed(coderegionsformtheapplicationbinaryfile,sharedlibrary,shared
memorymappings,andsoon).Filebuffersaremanagedbythekernel'spage
cacheframework,whichimplementsitsowndatastructurestorepresentand
managefilecaches.Thepagecachetracksmappingstofileregionsbyvarious
user-modeprocessthroughanaddress_spacedatastructure.Thesharedelementof
thevm_area_structobjectenumeratesthisVMAintoared-blacktreeassociated
withtheaddressspace.We'lldiscussmoreaboutthepagecacheandaddress_space
objectsinthenextsection.
Regionsofthevirtualaddressspacesuchasheap,stack,andmmapareallocated
throughanonymousmemorymappings.TheVMsubsystemgroupsallVMA
instancesoftheprocessthatrepresentanonymousmemoryregionsintoalistand
representsthemthroughadescriptoroftypestructanon_vma.Thisstructure
enablesquickaccesstoalloftheprocessVMAsthatmapanonymouspages;the
*anon_vmapointerofeachanonymousVMAstructurereferstotheanon_vmaobject.
However,whenaprocessforksachild,allanonymouspagesofthecaller
addressspacearesharedwiththechildprocessundercopy-on-write(COW).
ThiscausesnewVMAstobecreated(forthechild)thatrepresentthesame
anonymousmemoryregionsoftheparent.Thememorymanagerwouldneedto
locateandtrackallVMAsthatrefertothesameregionsforittobeableto
supportunmapandswap-outoperations.Asasolution,theVMsubsystemuses
anotherdescriptorcalledstructanon_vma_chainthatlinksallanon_vmastructuresofa
processgroup.Theanon_vma_chainelementoftheVMAstructureisalistelement
oftheanonymousVMAchain.
EachVMAinstanceisboundtoadescriptoroftypevm_operations_struct,which
containsoperationsperformedonthecurrentVMA.The*vm_opspointerofthe
VMAinstancereferstotheoperationsobject:
/*
*ThesearethevirtualMMfunctions-openingofanarea,closingand
*unmappingit(neededtokeepfilesondiskup-to-dateetc),pointer
*tothefunctionscalledwhenano-pageorawp-pageexceptionoccurs.
*/
structvm_operations_struct{
void(*open)(structvm_area_struct*area);
void(*close)(structvm_area_struct*area);
int(*mremap)(structvm_area_struct*area);
int(*fault)(structvm_area_struct*vma,structvm_fault*vmf);
int(*pmd_fault)(structvm_area_struct*,unsignedlongaddress,
pmd_t*,unsignedintflags);
void(*map_pages)(structfault_env*fe,
pgoff_tstart_pgoff,pgoff_tend_pgoff);
/*notificationthatapreviouslyread-onlypageisabouttobecome
*writable,ifanerrorisreturneditwillcauseaSIGBUS*/
int(*page_mkwrite)(structvm_area_struct*vma,structvm_fault*vmf);
/*sameaspage_mkwritewhenusingVM_PFNMAP|VM_MIXEDMAP*/
int(*pfn_mkwrite)(structvm_area_struct*vma,structvm_fault*vmf);
/*calledbyaccess_process_vmwhenget_user_pages()fails,typically
*forusebyspecialVMAsthatcanswitchbetweenmemoryandhardware
*/
int(*access)(structvm_area_struct*vma,unsignedlongaddr,
void*buf,intlen,intwrite);
/*Calledbythe/proc/PID/mapscodetoaskthevmawhetherit
*hasaspecialname.Returningnon-NULLwillalsocausethis
*vmatobedumpedunconditionally.*/
constchar*(*name)(structvm_area_struct*vma);
...
...
Theroutineassignedtothe*open()functionpointerisinvokedwhentheVMAis
enumeratedintotheaddressspace.Similarly,theroutineassignedtothe*close()
functionpointerisinvokedwhentheVMAisdetachedfromthevirtualaddress
space.Thefunctionassignedtothe*mremap()interfaceisexecutedwhenthe
memoryareamappedbytheVMAistoberesized.Whenthephysicalregion
mappedbytheVMAisinactive,thesystemtriggersapagefaultexception,and
thefunctionassignedtothe*fault()pointerisinvokedbythekernel'spage-fault
handlertoreadcorrespondingdataoftheVMAregionintothephysicalpage.
Thekernelsupportsdirectaccessoperations(DAX)forfilesonstoragedevices
thataresimilartomemory,suchasnvrams,flashstorage,andotherpersistent
memorydevices.Driversforsuchstoragedevicesareimplementedtoperform
allreadandwriteoperationsdirectlyonstorage,withoutanycaching.Whena
userprocessattemptstomapafilefromaDAXstoragedevice,theunderlying
diskdriverdirectlymapsthecorrespondingfilepagestoprocessthevirtual
addressspace.Foroptimalperformance,user-modeprocessescanmaplarge
filesfromDAXstoragebyenablingVM_HUGETLB.Duetothelargepagesizes
supported,pagefaultsonDAXfilemapscannotbehandledthroughregular
pagefaulthandlers,andfilesystemssupportingDAXneedtoassignappropriate
faulthandlerstothe*pmd_fault()pointeroftheVMA.
Managingvirtualmemoryareas
Thekernel'sVMsubsystemimplementsvariousoperationstomanipulatethe
virtualmemoryregionsofaprocess;theseincludefunctionstocreate,insert,
modify,locate,merge,anddeleteVMAinstances.Wewilldiscussafewofthe
importantroutines.
LocatingaVMA
Thefind_vma()routinelocatesthefirstregionintheVMAlistthatsatisfiesthe
conditionforagivenaddress(addr<vm_area_struct->vm_end).
/*LookupthefirstVMAwhichsatisfiesaddr<vm_end,NULLifnone.*/
structvm_area_struct*find_vma(structmm_struct*mm,unsignedlongaddr)
{
structrb_node*rb_node;
structvm_area_struct*vma;
/*Checkthecachefirst.*/
vma=vmacache_find(mm,addr);
if(likely(vma))
returnvma;
rb_node=mm->mm_rb.rb_node;
while(rb_node){
structvm_area_struct*tmp;
tmp=rb_entry(rb_node,structvm_area_struct,vm_rb);
if(tmp->vm_end>addr){
vma=tmp;
if(tmp->vm_start<=addr)
break;
rb_node=rb_node->rb_left;
}else
rb_node=rb_node->rb_right;
}
if(vma)
vmacache_update(addr,vma);
returnvma;
}
Thefunctionfirstchecksfortherequestedaddressintherecentlyaccessedvma
foundintheper-threadvmacache.Onamatch,itreturnstheaddressoftheVMA,
elseitstepsintothered-blacktreetolocatetheappropriateVMA.Therootnode
ofthetreeislocatedinmm->mm_rb.rb_node.Throughthehelperfunctionrb_entry(),
eachnodeisverifiedfortheaddresswithinthevirtualaddressintervalofthe
VMA.IfthetargetVMAwithalowerstartaddressandhigherendaddressthan
thespecifiedaddressislocated,thefunctionreturnstheaddressoftheVMA
instance.IftheappropriateVMAisstillnotfound,thesearchcontinuesits
lookupintotheleftorrightchildnodesoftherbtree.WhenasuitableVMAis
found,apointertoitisupdatedtothevmacache(anticipatingthenextcallto
find_vma()tolocatetheneighboringaddressinthesameregion),anditreturnsthe
addressoftheVMAinstance.
Whenanewregionisaddedimmediatelybeforeorafteranexistingregion(and
thereforealsobetweentwoexistingregions),thekernelmergesthedata
structuresinvolvedintoasinglestructure—but,ofcourse,onlyiftheaccess
permissionsforalltheregionsinvolvedareidenticalandcontiguousdatais
mappedfromthesamebackingstore.
MergingVMAregions
WhenanewVMAismappedimmediatelybeforeorafteranexistingVMAwith
identicalaccessattributesanddatafromafile-backedmemoryregion,itismore
optimaltomergethemintoasingleVMAstructure.vma_merge()isahelper
functionthatisinvokedtomergesurroundingVMAswithidenticalattributes:
structvm_area_struct*vma_merge(structmm_struct*mm,
structvm_area_struct*prev,unsignedlongaddr,
unsignedlongend,unsignedlongvm_flags,
structanon_vma*anon_vma,structfile*file,
pgoff_tpgoff,structmempolicy*policy,
structvm_userfaultfd_ctxvm_userfaultfd_ctx)
{
pgoff_tpglen=(end-addr)>>PAGE_SHIFT;
structvm_area_struct*area,*next;
interr;
...
...
*mmreferstothememorydescriptoroftheprocesswhoseVMAsaretobe
merged;*prevreferstoaVMAwhoseaddressintervalprecedesthenewregion;
andtheaddr,end,andvm_flagscontainthestart,end,andflagsofthenewregion.
*filereferstothefileinstancewhosememoryregionismappedtothenew
region,andpgoffspecifiestheoffsetofthemappingwithinthefiledata.
Thisfunctionfirstchecksifthenewregioncanbemergedwiththepredecessor:
...
...
/*
*Canitmergewiththepredecessor?
*/
if(prev&&prev->vm_end==addr&&
mpol_equal(vma_policy(prev),policy)&&
can_vma_merge_after(prev,vm_flags,
anon_vma,file,pgoff,
vm_userfaultfd_ctx)){
...
...
Forthis,itinvokesahelperfunctioncan_vma_merge_after(),whichchecksiftheend
addressofthepredecessorcorrespondstothestartaddressofthenewregion,
andifaccessflagsareidenticalforbothregions,italsochecksoffsetsoffile
mappingstoensurethattheyarecontiguousinfileregion,andthatbothregions
donotcontainanyanonymousmappings:
...
...
/*
*OK,itcan.Canwenowmergeinthesuccessoraswell?
*/
if(next&&end==next->vm_start&&
mpol_equal(policy,vma_policy(next))&&
can_vma_merge_before(next,vm_flags,
anon_vma,file,
pgoff+pglen,
vm_userfaultfd_ctx)&&
is_mergeable_anon_vma(prev->anon_vma,
next->anon_vma,NULL)){
/*cases1,6*/
err=__vma_adjust(prev,prev->vm_start,
next->vm_end,prev->vm_pgoff,NULL,
prev);
}else/*cases2,5,7*/
err=__vma_adjust(prev,prev->vm_start,
end,prev->vm_pgoff,NULL,prev);
...
...
}
Itthenchecksifmergingisapossibilitywiththesuccessorregion;forthisit
invokesthehelperfunctioncan_vma_merge_before().Thisfunctioncarriesoutsimilar
checksasbeforeandifboththepredecessorandthesuccessorregionsarefound
identical,thenis_mergeable_anon_vma()isinvokedtocheckifanyanonymous
mappingsofthepredecessorcanbemergedwiththoseofthesuccessor.Finally,
anotherhelperfunction__vma_adjust()isinvokedtoperformthefinalmerging,
whichmanipulatestheVMAinstancesappropriately.
Similartypesofhelperfunctionsexistforcreating,inserting,anddeleting
memoryregions,whichareinvokedashelperfunctionsfromdo_mmap()and
do_munmap(),calledwhenuser-modeapplicationsattempttommap()andunmap()
memoryregions,respectively.Wewillnotdiscussdetailsofthesehelper
routinesanyfurther.
structaddress_space
Memorycachesareanintegralpartofmodernmemorymanagement.Insimple
words,acacheisacollectionofpagesusedforspecificneeds.Mostoperating
systemsimplementabuffercache,whichisaframeworkthatmanagesalistof
memoryblocksforcachingpersistentstoragediskblocks.Thebuffercache
allowsfilesystemstominimizediskI/Ooperationsbygroupinganddeferring
disksyncuntilappropriatetime.
TheLinuxkernelimplementsapagecacheasamechanismforcaching;in
simplewords,thepagecacheisacollectionofpageframesthataredynamically
managedforcachingdiskfilesanddirectories,andsupportvirtualmemory
operationsbyprovidingpagesforswappinganddemandpaging.Italsohandles
pagesallocatedforspecialfiles,suchasIPCsharedmemoryandmessage
queues.ApplicationfileI/Ocallssuchasreadandwritecausetheunderlying
filesystemtoperformtherelevantoperationonpagesinthepagecache.Read
operationsonanunreadfilecausetherequestedfiledatatobefetchedfromdisk
intopagesofthepagecache,andwriteoperationsupdatetherelevantfiledatain
cachedpages,whicharethenmarkeddirtyandflushedtodiskatspecific
intervals.
Groupsofpagesincachethatcontaindataofaspecificdiskfilearerepresented
throughadescriptoroftypestructaddress_space,soeachaddress_spaceinstance
servesasanabstractionforasetofpagesownedbyeitherafileinodeorblock
devicefileinode:
structaddress_space{
structinode*host;/*owner:inode,block_device*/
structradix_tree_rootpage_tree;/*radixtreeofallpages*/
spinlock_ttree_lock;/*andlockprotectingit*/
atomic_ti_mmap_writable;/*countVM_SHAREDmappings*/
structrb_rooti_mmap;/*treeofprivateandsharedmappings*/
structrw_semaphorei_mmap_rwsem;/*protecttree,count,list*/
/*Protectedbytree_locktogetherwiththeradixtree*/
unsignedlongnrpages;/*numberoftotalpages*/
/*numberofshadoworDAXexceptionalentries*/
unsignedlongnrexceptional;
pgoff_twriteback_index;/*writebackstartshere*/
conststructaddress_space_operations*a_ops;/*methods*/
unsignedlongflags;/*errorbits*/
spinlock_tprivate_lock;/*forusebytheaddress_space*/
gfp_tgfp_mask;/*implicitgfpmaskforallocations*/
structlist_headprivate_list;/*ditto*/
void*private_data;/*ditto*/
}__attribute__((aligned(sizeof(long))));
The*hostpointerreferstotheownerinodewhosedataiscontainedinthepages
representedbythecurrentaddress_spaceobject.Forinstance,ifapageinthecache
containsdataofafilemanagedbytheExt4filesystem,thecorrespondingVFS
inodeofthefilestorestheaddress_spaceobjectinitsi_datafield.Theinodeofthe
fileandthecorrespondingaddress_spaceobjectisstoredinthei_datafieldofthe
VFSinodeobject.Thenr_pagesfieldcontainsthecountofpagesunderthis
address_space.
Forefficientmanagementoffilepagesincache,theVMsubsystemneedsto
trackallvirtualaddressmappingstoregionsofthesameaddress_space;for
instance,anumberofuser-modeprocessesmightmappagesofasharedlibrary
intotheiraddressspacethroughvm_area_structinstances.Thei_mmapfieldofthe
address_spaceobjectistherootelementofared-blacktreethatcontainsallvm_area
_structinstancescurrentlymappedtothisaddress_space;sinceeachvm_area_struct
instancerefersbacktothememorydescriptoroftherespectiveprocess,itwould
alwaysbepossibletotrackprocessreferences.
Allphysicalpagescontainingfiledataundertheaddress_spaceobjectare
organizedthrougharadixtreeforefficientaccess;thepage_treefieldisan
instanceofstructradix_tree_rootthatservesarootelementfortheradixtreeof
pages.Thisstructureisdefinedinthekernelheader<linux/radix-tree.h>:
structradix_tree_root{
gfp_tgfp_mask;
structradix_tree_node__rcu*rnode;
};
Eachnodeoftheradixtreeisoftypestructradix_tree_node;the*rnodepointerof
thepreviousstructurereferstothefirstnodeelementofthetree:
structradix_tree_node{
unsignedcharshift;/*Bitsremainingineachslot*/
unsignedcharoffset;/*Slotoffsetinparent*/
unsignedintcount;
union{
struct{
/*Usedwhenascendingtree*/
structradix_tree_node*parent;
/*Fortreeuser*/
void*private_data;
};
/*Usedwhenfreeingnode*/
structrcu_headrcu_head;
};
/*Fortreeuser*/
structlist_headprivate_list;
void__rcu*slots[RADIX_TREE_MAP_SIZE];
unsignedlongtags[RADIX_TREE_MAX_TAGS][RADIX_TREE_TAG_LONGS];
};
Theoffsetfieldspecifiesthenodeslotoffsetintheparent,countholdsthetotal
countofchildnodes,and*parentisapointertotheparentnode.Eachnodecan
referto64treenodes(specifiedbythemacroRADIX_TREE_MAP_SIZE)throughtheslots
array,whereunusedslotentriesareinitializedwithNULL.
Forefficientmanagementofpagesunderanaddressspace,itisimportantforthe
memorymanagertosetacleardistinctionbetweencleananddirtypages;thisis
madepossiblethroughtagsassignedforpagesofeachnodeoftheradixtree.The
tagginginformationisstoredinthetagsfieldofthenodestructure,whichisa
two-dimensionalarray.Thefirstdimensionofthearraydistinguishesbetween
thepossibletags,andthesecondcontainsasufficientnumberofelementsof
unsignedlongssothatthereisabitforeachpagethatcanbeorganizedinthe
node.Followingisthelistoftagssupported:
/*
*Radix-treetags,fortaggingdirtyandwritebackpageswithin
*pagecacheradixtrees
*/
#definePAGECACHE_TAG_DIRTY0
#definePAGECACHE_TAG_WRITEBACK1
#definePAGECACHE_TAG_TOWRITE2
TheLinuxradixtreeAPIprovidesvariousoperationinterfacestoset,clear,and
gettags:
void*radix_tree_tag_set(structradix_tree_root*root,
unsignedlongindex,unsignedinttag);
void*radix_tree_tag_clear(structradix_tree_root*root,
unsignedlongindex,unsignedinttag);
intradix_tree_tag_get(structradix_tree_root*root,
unsignedlongindex,unsignedinttag);
Thefollowingdiagramdepictsthelayoutofpagesundertheaddress_spaceobject:
Eachaddressspaceobjectisboundtoasetoffunctionsthatimplementvarious
low-leveloperationsbetweenaddressspacepagesandtheback-storeblock
device.Thea_opspointeroftheaddress_spacestructurereferstothedescriptor
containingaddressspaceoperations.TheseoperationsareinvokedbyVFSto
initiatedatatransfersbetweenpagesincacheassociatedwithanaddressmap
andback-storeblockdevice:
Pagetables
Allaccessoperationsonprocessvirtualaddressregionsareputthroughaddress
translationbeforereachingtheappropriatephysicalmemoryregions.TheVM
subsystemmaintainspagetablestotranslatelinearpageaddressesintophysical
addresses.Eventhoughthepagetablelayoutisarchitecturespecific,formost
architectures,thekernelusesafour-levelpagingstructure,andwewillconsider
thex86-64kernelpagetablelayoutforthisdiscussion.
Thefollowingdiagramdepictsthelayoutofthepagetableforx86-64:
Theaddressofthepageglobaldirectory,whichisthetop-levelpagetable,is
initializedintocontrolregistercr3.Thisisa64-bitregisterfollowingbitbreak-
up:
Bits Description
2:0 Ignored
4:3 Pagelevelwrite-throughandpage-levelcachedisable
11:5 Reserved
51:12 Addressofpageglobaldirectory
63:52 Reserved
Outof64bit-widelinearaddressessupportedbyx86-64,Linuxcurrentlyuses
48bitsthatenable256TBoflinearaddressspace,whichisconsideredlarge
enoughforcurrentuse.This48-bitlinearaddressissplitintofiveparts,withthe
first12bitscontainingtheoffsetofthememorylocationinthephysicalframe
andrestofthepartscontainingoffsetsintoappropriatepagetablestructures:
Linearaddressbits Description
11:0(12bits) Indexofphysicalpage
20:12(9bits) Indexofpagetable
29:21(9bits) Indexofpagemiddledirectory
38:30(9bits) Indexofpageupperdirectory
47:39(9bits) Indexofpageglobaldirectory
Eachofthepagetablestructurescansupport512records,ofwhicheachrecord
providesthebaseaddressofthenext-levelpagestructure.Duringtranslationofa
givenlinearaddress,MMUextractsthetop9bitscontainingtheindexintothe
pageglobaldirectory(PGD),whichisthenaddedtothebaseaddressofPGD
(foundincr3);thislookupresultsinthediscoveryofthebaseaddressforpage
upperdirectory(PUD).Next,MMUretrievesthePUDoffset(9bits)foundin
thelinearaddress,andaddsittothebaseaddressofPUDstructuretoreachthe
PUDentry(PUDE)thatyieldsthebaseaddressofpagemiddledirectory(PMD).
ThePMDoffsetfoundinthelinearaddressisthenaddedtothebaseaddressof
PMDtoreachtherelevantPMDentry(PMDE),whichyieldsthebaseaddressof
thepagetable.Thepagetableoffset(9bits)foundinthelinearaddressisthen
addedtothebaseaddressdiscoveredfromthePMDentrytoreachthepagetable
entry(PTE),whichinturnyieldsthestartaddressofthephysicalframeofthe
requesteddata.Finally,thepageoffset(12bits)foundinthelinearaddressis
addedtothePTEdiscoveredbaseaddresstoreachthememorylocationtobe
accessed.
Summary
Inthischapter,wefocusedonspecificsofvirtualmemorymanagementwith
respecttoprocessvirtualaddressspaceandmemorymaps.Wediscussedcritical
datastructuresoftheVMsubsystem,memorydescriptorstructure(struct
mm_struct),andVMAdescriptor(structvm_area_struct).Welookedatthepage
cacheanditsdatastructures(structaddress_space)usedinreversemappingoffile
buffersintovariousprocessaddressspaces.Finally,weexploredthepagetable
layoutofLinux,whichiswidelyusedinmanyarchitectures.Havinggaineda
thoroughunderstandingoffilesystemsandvirtualmemorymanagement,inthe
nextchapter,wewillextendthisdiscussionintotheIPCsubsystemandits
resources.
KernelSynchronizationandLocking
Kerneladdressspaceissharedbyalluser-modeprocesses,whichenables
concurrentaccesstokernelservicesanddatastructures.Forreliablefunctioning
ofthesystem,itisimperativethatkernelservicesbeimplementedtobere-
entrant.Kernelcodepathsaccessingglobaldatastructuresneedtobe
synchronizedtoensureconsistencyandvalidityofshareddata.Inthischapter,
wewillgetintodetailsofvariousresourcesatthedisposalofkernel
programmersforsynchronizationofkernelcodepathsandprotectionofshared
datafromconcurrentaccess.
Thischapterwillcoverthefollowingtopics:
Atomicoperations
Spinlocks
Standardmutexes
Wait/woundmutex
Semaphores
Sequencelocks
Completions
Atomicoperations
Acomputationoperationisconsideredtobeatomicifitappearstotherestof
thesystemtooccurinstantaneously.Atomicityguaranteesindivisibleand
uninterruptibleexecutionoftheoperationinitiated.MostCPUinstructionset
architecturesdefineinstructionopcodesthatcanperformatomicread-modify-
writeoperationsonamemorylocation.Theseoperationshaveasucceed-or-fail
definition,thatis,theyeithersuccessfullychangethestateofthememory
locationorfailwithnoapparenteffect.Theseoperationsarehandyfor
manipulationofshareddataatomicallyinamulti-threadedscenario.Theyalso
serveasfoundationalbuildingblocksforimplementationofexclusionlocks,
whichareengagedtoprotectsharedmemorylocationsfromconcurrentaccess
byparallelcodepaths.
Linuxkernelcodeusesatomicoperationsforvarioususecases,suchas
referencecountersinshareddatastructures(whichareusedtotrackconcurrent
accesstovariouskerneldatastructures),wait-notifyflags,andforenabling
exclusiveownershipofdatastructurestoaspecificcodepath.Toensure
portabilityofkernelservicesthatdirectlydealwithatomicoperations,thekernel
providesarichlibraryofarchitecture-neutralinterfacemacrosandinline
functionsthatserveasabstractionstoprocessor-dependentatomicinstructions.
RelevantCPU-specificatomicinstructionsundertheseneutralinterfacesare
implementedbythearchitecturebranchofthekernelcode.
typedefstruct{<br/>intcounter;<br/>}atomic_t;<br/><br/>#ifdef
CONFIG_64BIT<br/>typedefstruct{<br/>longcounter;<br/>}
atomic64_t;<br/>#endif
Theimplementationprovidestwogroupsofintegeroperations;one
setapplicableon32bitandtheothergroupfor64bitatomic
variables.Theseinterfaceoperationsareimplementedasasetof
macrosandinlinefunctions.Followingisasummarizedlistof
operationsapplicableonatomic_ttypevariables:
Interfacemacro/Inlinefunction Description
ATOMIC_INIT(i) Macrotoinitializeanatomiccounter
atomic_read(v) Readvalueoftheatomiccounterv
atomic_set(v,i) Atomicallysetcountervtovaluespecifiedin
i
atomic_add(inti,atomic_t
*v) Atomicallyadditocounterv
atomic_sub(inti,atomic_t
*v) Atomicallysubtractifromcounterv
atomic_inc(atomic_t*v) Atomicallyincrementcounterv
atomic_dec(atomic_t*v) Atomicallydecrementcounterv
Followingisalistoffunctionsthatperformrelevantread-modify-
write(RMW)operationsandreturntheresult(thatis,theyreturnthe
valuethatwaswrittentothememoryaddressafterthemodification):
Operation Description
boolatomic_sub_and_test(int
i,atomic_t*v)
Atomicallysubtractsifromvandreturns
trueiftheresultiszero,orfalseotherwise
bool
atomic_dec_and_test(atomic_t
*v)
Atomicallydecrementsvby1andreturns
trueiftheresultis0,orfalseforallother
cases
bool Atomicallyaddsitovandreturnstrueifthe
atomic_inc_and_test(atomic_t
*v) resultis0,orfalseforallothercases
boolatomic_add_negative(int
i,atomic_t*v)
Atomicallyaddsitovandreturnstrueifthe
resultisnegative,orfalsewhenresultis
greaterthanorequaltozero
intatomic_add_return(inti,
atomic_t*v) Atomicallyaddsitovandreturnstheresult
intatomic_sub_return(inti,
atomic_t*v)
Atomicallysubtractsifromvandreturnsthe
result
intatomic_fetch_add(inti,
atomic_t*v)
Atomicallyaddsitovandreturnpre-addition
valueatv
intatomic_fetch_sub(inti,
atomic_t*v)
Atomicallysubtractsifromv,andreturnpre-
subtractvalueatv
intatomic_cmpxchg(atomic_t
*v,intold,intnew)
Readsthevalueatlocationv,andchecksifit
isequaltoold;iftrue,swapsvalueatvwith
new,andalwaysreturnsvaluereadatv
intatomic_xchg(atomic_t*v,
intnew)
Swapstheoldvaluestoredatlocationvwith
new,andreturnsoldvaluev
Foralloftheseoperations,64-bitvariantsexistforusewith
atomic64_t;thesefunctionshavethenamingconventionatomic64_*
().
Atomicbitwiseoperations
Kernel-providedgenericatomicoperationinterfacesalsoincludebitwise
operations.Unlikeintegeroperations,whichareimplementedtooperateonthe
atomic(64)_ttype,thesebitoperationscanbeappliedonanymemorylocation.
Theargumentstotheseoperationsarethepositionofthebitorbitnumber,anda
pointerwithavalidaddress.Thebitrangeis0-31for32-bitmachinesand0-63
for64-bitmachines.Followingisasummarizedlistofbitwiseoperations
available:
Operationinterface Description
set_bit(intnr,volatile
unsignedlong*addr)
Atomicallysetthebitnrinlocationstarting
fromaddr
clear_bit(intnr,volatile
unsignedlong*addr)
Atomicallyclearthebitnrinlocationstarting
fromaddr
change_bit(intnr,volatile
unsignedlong*addr)
Atomicallyflipthebitnrinthelocationstarting
fromaddr
inttest_and_set_bit(intnr,
volatileunsignedlong*addr)
Atomicallysetthebitnrinthelocationstarting
fromaddr,andreturnoldvalueatthenrthbit
inttest_and_clear_bit(intnr,
volatileunsignedlong*addr)
Atomicallyclearthebitnrinthelocation
startingfromaddr,andreturnoldvalueatthe
nrthbit
inttest_and_change_bit(int
nr,volatileunsignedlong
*addr)
Atomicallyflipthebitnrinthelocationstarting
fromaddr,andreturnoldvalueatthenrthbit
Foralltheoperationswithareturntype,thevaluereturnedistheoldstateofthe
bitthatwasreadoutofthememoryaddressbeforethespecifiedmodification
happened.Non-atomicversionsoftheseoperationsalsoexist;theyareefficient
andusefulforcasesthatmightneedbitmanipulations,initiatedfromcode
statementsinamutuallyexclusivecriticalblock.Thesearedeclaredinthe
kernelheader<linux/bitops/non-atomic.h>.
Introducingexclusionlocks
Hardware-specificatomicinstructionscanoperateonlyonCPUword-and
doubleword-sizedata;theycannotbedirectlyappliedonshareddatastructures
ofcustomsize.Formostmulti-threadedscenarios,oftenitcanbeobservedthat
shareddataisofcustomsizes,forexample,astructurewithnelementsof
varioustypes.Concurrentcodepathsaccessingsuchdatausuallycomprisea
bunchofinstructionsthatareprogrammedtoaccessandmanipulateshareddata;
suchaccessoperationsmustbeexecutedatomicallytopreventraces.Toensure
atomicityofsuchcodeblocks,mutualexclusionlocksareused.Allmulti-
threadingenvironmentsprovideimplementationofexclusionlocksthatare
basedonexclusionprotocols.Theselockingimplementationsarebuiltontopof
hardware-specificatomicinstructions.
TheLinuxkernelimplementsoperationinterfacesforstandardexclusion
mechanismssuchasmutualandreader-writerexclusions.Italsocontains
supportforvariousothercontemporarylightweightandlock-free
synchronizationmechanisms.Mostkerneldatastructuresandothershareddata
elementssuchassharedbuffersanddeviceregistersareprotectedfrom
concurrentaccessthroughappropriateexclusion-lockinginterfacesofferedby
thekernel.Inthissectionwewillexploreavailableexclusionsandtheir
implementationdetails.
Spinlocks
Spinlocksareoneofthesimplestandlightweightmutualexclusionmechanisms
widelyimplementedbymostconcurrentprogrammingenvironments.Aspinlock
implementationdefinesalockstructureandoperationsthatmanipulatethelock
structure.Thelockstructureprimarilyhostsanatomiclockcounteramongother
elements,andoperationsinterfacesinclude:
Aninitializerroutine,thatinitializesaspinlockinstancetothedefault
(unlock)state
Alockroutine,thatattemptstoacquirespinlockbyalteringthestateofthe
lockcounteratomically
Anunlockroutine,thatreleasesthespinlockbyalteringcounterinto
unlockstate
Whenacallercontextattemptstoacquirespinlockwhileitislocked(orheldby
anothercontext),thelockfunctioniterativelypollsorspinsforthelockuntil
available,causingthecallercontexttohogtheCPUuntillockisacquired.Itis
duetothisfactthatthisexclusionmechanismisaptlynamedspinlock.Itis
thereforeadvisedtoensurethatcodewithincriticalsectionsisatomicornon-
blocking,sothatlockcanbeheldforashort,deterministicduration,asitis
apparentthatholdingaspinlockforalongdurationcouldprovedisastrous.
Asdiscussed,spinlocksarebuiltaroundprocessor-specificatomicoperations;
thearchitecturebranchofthekernelimplementscorespinlockoperations
(assemblyprogrammed).Thekernelwrapsthearchitecture-specific
implementationthroughagenericplatform-neutralinterfacethatisdirectly
usablebykernelservice;thisenablesportabilityoftheservicecodewhich
engagesspinlocksforprotectionofsharedresources.
Genericspinlockinterfacescanbefoundinthekernelheader<linux/spinlock.h>
whilearchitecture-specificdefinitionsarepartof<asm/spinlock.h>.Thegeneric
interfaceprovidesabunchoflock()andunlock()operations,eachimplemented
foraspecificusecase.Wewilldiscusseachoftheseinterfacesinthesectionsto
follow;fornow,let'sbeginourdiscussionwiththestandardandmostbasic
variantsoflock()andunlock()operationsofferedbytheinterface.Thefollowing
codesampleshowstheusageofabasicspinlockinterface:
DEFINE_SPINLOCK(s_lock);
spin_lock(&s_lock);
/*criticalregion...*/
spin_unlock(&s_lock);
Let'sexaminetheimplementationofthesefunctionsunderthehood:
static__always_inlinevoidspin_lock(spinlock_t*lock)
{
raw_spin_lock(&lock->rlock);
}
...
...
static__always_inlinevoidspin_unlock(spinlock_t*lock)
{
raw_spin_unlock(&lock->rlock);
}
Kernelcodeimplementstwovariantsofspinlockoperations;onesuitablefor
SMPplatformsandtheotherforuniprocessorplatforms.Spinlockdatastructure
andoperationsrelatedtothearchitectureandtypeofbuild(SMPandUP)are
definedinvariousheadersofthekernelsourcetree.Let'sfamiliarizeourselves
withtheroleandimportanceoftheseheaders:
<include/linux/spinlock.h>containsgenericspinlock/rwlockdeclarations.
ThefollowingheadersarerelatedtoSMPplatformbuilds:
<asm/spinlock_types.h>containsarch_spinlock_t/arch_rwlock_tandinitializers
<linux/spinlock_types.h>definesthegenerictypeandinitializers
<asm/spinlock.h>containsthearch_spin_*()andsimilarlow-leveloperation
implementations
<linux/spinlock_api_smp.h>containstheprototypesforthe_spin_*()APIs
<linux/spinlock.h>buildsthefinalspin_*()APIs
Thefollowingheadersarerelatedtouniprocessor(UP)platformbuilds:
<linux/spinlock_type_up.h>containsthegeneric,simplifiedUPspinlocktype
<linux/spinlock_types.h>definesthegenerictypeandinitializers
<linux/spinlock_up.h>containsthearch_spin_*()andsimilarversionofUP
builds(whichareNOPsonnon-debug,non-preemptbuilds)
<linux/spinlock_api_up.h>buildsthe_spin_*()APIs
<linux/spinlock.h>buildsthefinalspin_*()APIs
Thegenerickernelheader<linux/spinlock.h>containsaconditionaldirectiveto
decideontheappropriate(SMPorUP)APItopull.
/*
*Pullthe_spin_*()/_read_*()/_write_*()functions/declarations:
*/
#ifdefined(CONFIG_SMP)||defined(CONFIG_DEBUG_SPINLOCK)
#include<linux/spinlock_api_smp.h>
#else
#include<linux/spinlock_api_up.h>
#endif
Theraw_spin_lock()andraw_spin_unlock()macrosdynamicallyexpandtothe
appropriateversionofspinlockoperationsbasedonthetypeofplatform(SMP
orUP)choseninthebuildconfiguration.ForSMPplatforms,raw_spin_lock()
expandstothe__raw_spin_lock()operationimplementedinthekernelsourcefile
kernel/locking/spinlock.c.Followingisthelockingoperationcodedefinedwitha
macro:
/*
*Webuildthe__lock_functioninlineshere.Theyaretoolargefor
*inliningallovertheplace,buthereisonlyoneuserperfunction
*whichembedsthemintothecalling_lock_functionbelow.
*
*Thiscouldbealong-heldlock.Webothpreparetospinforalong
*time(making_this_CPUpreemptableifpossible),andwealsosignal
*towardsthatotherCPUthatitshouldbreakthelockASAP.
*/
#defineBUILD_LOCK_OPS(op,locktype)\
void__lockfunc__raw_##op##_lock(locktype##_t*lock)\
{\
for(;;){\
preempt_disable();\
if(likely(do_raw_##op##_trylock(lock)))\
break;\
preempt_enable();\
\
if(!(lock)->break_lock)\
(lock)->break_lock=1;\
while(!raw_##op##_can_lock(lock)&&(lock)->break_lock)\
arch_##op##_relax(&lock->raw_lock);\
}\
(lock)->break_lock=0;\
}
Thisroutineiscomposedofnestedloopconstructs,anouterforloopconstruct,
andaninnerwhileloopthatspinsuntilthespecifiedconditionissatisfied.The
firstblockofcodeintheouterloopattemptstoacquirelockatomicallyby
invokingthearchitecture-specific##_trylock()routine.Noticethatthisfunctionis
invokedwithkernelpreemptiondisabledonthelocalprocessor.Iflockis
acquiredsuccessfully,itbreaksoutoftheloopconstructandthecallreturnswith
preemptionturnedoff.Thisensuresthatthecallercontextholdingthelockisnot
preemptableduringexecutionofacriticalsection.Thisapproachalsoensures
thatnoothercontextcancontendforthesamelockonthelocalCPUuntilthe
currentownerreleasesit.
However,ifitfailstoacquirelock,preemptionisenabledthroughthe
preempt_enable()call,andthecallercontextenterstheinnerloop.Thisloopis
implementedthroughaconditionalwhilethatspinsuntillockisfoundtobe
available.Eachiterationoftheloopchecksforlock,andwhenitdetectsthatthe
lockisnotavailableyet,itinvokesanarchitecture-specificrelaxroutine(which
executesaCPU-specificnopinstruction)beforespinningagaintocheckfor
lock.Recallthatduringthistimepreemptionisenabled;thisensuresthatthe
callercontextispreemptableanddoesnothogCPUforlongduration,whichcan
happenespeciallywhenlockishighlycontended.Italsoallowsthepossibilityof
twoormorethreadsscheduledonthesameCPUtocontendforthesamelock,
possiblybypreemptingeachother.
Whenaspinningcontextdetectsthatlockisavailablethroughraw_spin_can_lock(),
itbreaksoutofthewhileloop,causingthecallertoiteratebacktothebeginning
oftheouterloop(forloop)whereitagainattemptstograblockthrough
##_trylock()bydisablingpreemption:
/*
*IntheUP-nondebugcasethere'snoreallockinggoingon,sothe
*onlythingwehavetodoistokeepthepreemptcountsandirq
*flagsstraight,tosuppresscompilerwarningsofunusedlock
*variables,andtoaddthepropercheckerannotations:
*/
#define___LOCK(lock)\
do{__acquire(lock);(void)(lock);}while(0)
#define__LOCK(lock)\
do{preempt_disable();___LOCK(lock);}while(0)
#define_raw_spin_lock(lock)__LOCK(lock)
UnliketheSMPvariant,spinlockimplementationforUPplatformsisquite
simple;infact,thelockroutinejustdisableskernelpreemptionandputsthe
callerintoacriticalsection.Thisworkssincethereisnopossibilityofanother
contexttocontendforthelockwithpreemptionsuspended.
unsignedlong__lockfunc__raw_##op##_lock_irqsave(locktype##_t
*lock)\<br/>{\<br/>unsignedlongflags;\<br/>\<br/>for(;;){\
<br/>preempt_disable();\<br/><strong>local_irq_save(flags);
</strong>\<br/>if(likely(do_raw_##op##_trylock(lock)))\<br/>
break;\<br/><strong>local_irq_restore(flags);</strong>\<br/>
preempt_enable();\<br/>\<br/>if(!(lock)->break_lock)\<br/>
(lock)->break_lock=1;\<br/>while(!raw_##op##_can_lock(lock)
&&(lock)->break_lock)\<br/>arch_##op##_relax(&lock-
>raw_lock);\<br/>}\<br/>(lock)->break_lock=0;\<br/>return
flags;\<br/>}
void__lockfunc__raw_##op##_lock_bh(locktype##_t*lock)\<br/>{
\<br/>unsignedlongflags;\<br/>\<br/>/**/\<br/>/*Careful:we
mustexcludesoftirqstoo,hencethe*/\<br/>/*irq-disabling.Weuse
thegenericpreemption-aware*/\<br/>/*function:*/\<br/>/**/\
<br/>flags=_raw_##op##_lock_irqsave(lock);\<br/><strong>
local_bh_disable();</strong>\<br/>local_irq_restore(flags);\<br/>}
local_bh_disable()suspendsbottomhalfexecutionforthelocal
CPU.Toreleasealockacquiredbyspin_lock_bh(),thecallercontext
willneedtoinvokespin_unlock_bh(),whichreleasesspinlockand
BHlockforthelocalCPU.
FollowingisasummarizedlistofthekernelspinlockAPIinterface:
Function Description
spin_lock_init() Initializespinlock
spin_lock() Acquirelock,spinsoncontention
spin_trylock() Attempttoacquirelock,returnserroron
contention
spin_lock_bh() AcquirelockbysuspendingBHroutinesonthe
localprocessor,spinsoncontention
Acquirelockbysuspendinginterruptsonthelocal
spin_lock_irqsave() processorbysavingcurrentinterruptstate,spins
oncontention
spin_lock_irq() Acquirelockbysuspendinginterruptsonthelocal
processor,spinsoncontention
spin_unlock() Releasethelock
spin_unlock_bh() Releaselockandenablebottomhalfforthelocal
processor
spin_unlock_irqrestore() Releaselockandrestorelocalinterruptsto
previousstate
spin_unlock_irq() Releaselockandrestoreinterruptsforthelocal
processor
spin_is_locked() Returnstateofthelock,nonzeroiflockisheldor
zeroiflockisavailable
<spanclass="k">typedef</span><spanclass="k">struct</span>
<spanclass="p">{</span><spanclass="n">arch_rwlock_t</span>
<spanclass="n">raw_lock</span><spanclass="p">;</span><span
class="cp">#ifdefCONFIG_GENERIC_LOCKBREAK</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">break_lock</span><spanclass="p">;</span><span
class="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_DEBUG_SPINLOCK</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">magic</span><spanclass="p">,</span><span
class="n">owner_cpu</span><spanclass="p">;</span><span
class="kt">void</span><spanclass="o">*</span><span
class="n">owner</span><spanclass="p">;</span><span
class="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_DEBUG_LOCK_ALLOC</span>
<spanclass="k">struct</span><span
class="n">lockdep_map</span><spanclass="n">dep_map</span>
<spanclass="p">;</span><spanclass="cp">#endif</span>
<spanclass="p">}</span><spanclass="n">rwlock_t</span><span
class="p">;</span>
read_lock(&v_rwlock);<br/>/*criticalsectionwithreadonlyaccess
toshareddata*/<br/>read_unlock(&v_rwlock);
write_lock(&v_rwlock);<br/>/*criticalsectionforbothreadand
write*/<br/>write_unlock(&v_lock);
Bothreadandwritelockroutinesspinwhenlockiscontended.The
interfacealsooffersnon-spinningversionsoflockfunctionscalled
read_trylock()andwrite_trylock().Italsooffersinterrupt-
disablingversionsofthelockingcalls,whicharehandywheneither
thereadorwritepathhappenstoexecuteininterruptorbottom-half
context.
Followingisasummarizedlistofinterfaceoperations:
Function Description
read_lock() Standardreadlockinterface,spinsoncontention
read_trylock() Attemptstoacquirelock,returnserroriflockis
unavailable
read_lock_bh() AttemptstoacquirelockbysuspendingBH
executionforthelocalCPU,spinsoncontention
read_lock_irqsave()
Attemptstoacquirelockbysuspendinginterrupts
forthecurrentCPUbysavingcurrentstateof
localinterrupts,spinsoncontention
read_unlock() Releasesreadlock
read_unlock_irqrestore() Releaseslockheldandrestoreslocalinterruptsto
thepreviousstate
read_unlock_bh() ReleasesreadlockandenablesBHonthelocal
processor
write_lock() Standardwritelockinterface,spinsoncontention
write_trylock() Attemptstoacquirelock,returnserroron
contention
write_lock_bh()
Attemptstoacquirewritelockbysuspending
bottomhalvesforthelocalCPU,spinson
contention
wrtie_lock_irqsave()
Attemptstoacquirewritelockbysuspending
interruptsforthelocalCPUbysavingcurrent
stateoflocalinterrupts,.spinsoncontention
write_unlock() Releaseswritelock
write_unlock_irqrestore()
Releaseslockandrestoreslocalinterruptstothe
previousstate
write_unlock_bh() ReleaseswritelockandenablesBHonthelocal
processor
Underlyingcallsforalloftheseoperationsaresimilartothatof
spinlockimplementationsandcanbefoundinheadersspecifiedinthe
aforementionedspinlocksection.
Mutexlocks
Spinlocksbydesignarebettersuitedforscenarioswherelockisheldforshort,
fixedintervalsoftime,sincebusy-waitingforanindefinitedurationwouldhave
adireimpactonperformanceofthesystem.However,thereareamplesituations
wherealockisheldforlonger,non-deterministicdurations;sleepinglocksare
preciselydesignedtobeengagedforsuchsituations.Kernelmutexesarean
implementationofsleepinglocks:whenacallertaskattemptstoacquireamutex
thatisunavailable(alreadyownedbyanothercontext),itisputintosleepand
movedoutintoawaitqueue,forcingacontextswitchallowingtheCPUtorun
otherproductivetasks.Whenthemutexbecomesavailable,thetaskinthewait
queueiswokenupandmovedbytheunlockpathofthemutex,whichcanthen
attempttolockthemutex.
Mutexesarerepresentedbystructmutex,definedininclude/linux/mutex.hand
correspondingoperationsimplementedinthesourcefilekernel/locking/mutex.c:
structmutex{
atomic_long_towner;
spinlock_twait_lock;
#ifdefCONFIG_MUTEX_SPIN_ON_OWNER
structoptimistic_spin_queueosq;/*SpinnerMCSlock*/
#endif
structlist_headwait_list;
#ifdefCONFIG_DEBUG_MUTEXES
void*magic;
#endif
#ifdefCONFIG_DEBUG_LOCK_ALLOC
structlockdep_mapdep_map;
#endif
};
Initsbasicform,eachmutexcontainsa64-bitatomic_long_tcounter(owner),which
isusedbothforholdinglockstate,andtostoreareferencetothetaskstructure
ofthecurrenttaskowningthelock.Eachmutexcontainsawait-queue
(wait_list),andaspinlock(wait_lock)thatserializesaccesstowait_list.
ThemutexAPIinterfaceprovidesasetofmacrosandfunctionsfor
initialization,lock,unlock,andtoaccessthestatusofthemutex.Theseoperation
interfacesaredefinedin<include/linux/mutex.h>.
AmutexcanbedeclaredandinitializedwiththemacroDEFINE_MUTEX(name).
Thereisalsoanoptionofinitializingavalidmutexdynamicallythrough
mutex_init(mutex).
Asdiscussedearlier,oncontention,lockoperationsputthecallerthreadinto
sleep,whichrequiresthecallerthreadtobeputintoTASK_INTERRUPTIBLE,
TASK_UNINTERRUPTIBLE,orTASK_KILLABLEstates,beforemovingitintothemutexwait
list.Tosupportthis,themuteximplementationofferstwovariantsoflock
operations,oneforuninterruptibleandotherforinterruptiblesleep.Following
isalistofstandardmutexoperationswithashortdescriptionforeach:
/**
*mutex_lock-acquirethemutex
*@lock:themutextobeacquired
*
*Lockthemutexexclusivelyforthistask.Ifthemutexisnot
*availablerightnow,PutcallerintoUninterruptiblesleepuntilmutex
*isavailable.
*/
voidmutex_lock(structmutex*lock);
/**
*mutex_lock_interruptible-acquirethemutex,interruptible
*@lock:themutextobeacquired
*
*Lockthemutexlikemutex_lock(),andreturn0ifthemutexhas
*beenacquiredelseputcallerintointerruptiblesleepuntilthemutex
*untilmutexisavailable.Return-EINTRifasignalarriveswhilesleeping
*forthelock.
*/
int__must_checkmutex_lock_interruptible(structmutex*lock);
/**
*mutex_lock_Killable-acquirethemutex,interruptible
*@lock:themutextobeacquired
*
*Similartomutex_lock_interruptible(),withadifferencethatthecall
*returns-EINTRonlywhenfatalKILLsignalarriveswhilesleepingforthe
*lock.
*/
int__must_checkmutex_lock_killable(structmutex*lock);
/**
*mutex_trylock-trytoacquirethemutex,withoutwaiting
*@lock:themutextobeacquired
*
*Trytoacquirethemutexatomically.Returns1ifthemutex
*hasbeenacquiredsuccessfully,and0oncontention.
*
*/
intmutex_trylock(structmutex*lock);
/**
*atomic_dec_and_mutex_lock-returnholdingmutexifwedecto0,
*@cnt:theatomicwhichwearetodec
*@lock:themutextoreturnholdingifwedecto0
*
*returntrueandholdlockifwedecto0,returnfalseotherwise.Please
*notethatthisfunctionisinterruptible.
*/
intatomic_dec_and_mutex_lock(atomic_t*cnt,structmutex*lock);
/**
*mutex_is_locked-isthemutexlocked
*@lock:themutextobequeried
*
*Returns1ifthemutexislocked,0ifunlocked.
*/
staticinlineintmutex_is_locked(structmutex*lock);
/**
*mutex_unlock-releasethemutex
*@lock:themutextobereleased
*
*Unlockthemutexownedbycallertask.
*
*/
voidmutex_unlock(structmutex*lock);
Despitebeingpossibleblockingcalls,mutexlockingfunctionshavebeengreatly
optimizedforperformance.Theyareprogrammedtoengagefastandslowpath
approacheswhileattemptinglockacquisition.Let'sexplorethecodeunderthe
hoodofthelockingcallstobetterunderstandfastpathandslowpath.The
followingcodeexcerptisofthemutex_lock()routinefrom<kernel/locking/mutex.c>:
void__schedmutex_lock(structmutex*lock)
{
might_sleep();
if(!__mutex_trylock_fast(lock))
__mutex_lock_slowpath(lock);
}
Lockacquisitionisfirstattemptedbyinvokinganon-blockingfastpathcall
__mutex_trylock_fast().Ifitfailstoacquirelockthroughduetocontention,itenters
slowpathbyinvoking__mutex_lock_slowpath():
static__always_inlinebool__mutex_trylock_fast(structmutex*lock)
{
unsignedlongcurr=(unsignedlong)current;
if(!atomic_long_cmpxchg_acquire(&lock->owner,0UL,curr))
returntrue;
returnfalse;
}
Thisfunctionisprogrammedtoacquirelockatomicallyifavailable.Itinvokes
theatomic_long_cmpxchg_acquire()macro,whichattemptstoassignthecurrentthread
astheownerofthemutex;thisoperationwillsucceedifthemutexisavailable,
inwhichcasethefunctionreturnstrue.Shouldsomeotherthreadownthemutex,
thisfunctionwillfailandreturnfalse.Onfailure,thecallerthreadwillenterthe
slowpathroutine.
Conventionally,theconceptofslowpathhasalwaysbeentoputthecallertask
intosleepwhilewaitingforthelocktobecomeavailable.However,withthe
adventofmany-coreCPUs,thereisagrowingneedforscalabilityandimproved
performance,sowithanobjectivetoachievescalability,themutexslowpath
implementationhasbeenreworkedwithanoptimizationcalledoptimistic
spinning,a.k.a.midpath,whichcanimproveperformanceconsiderably.
Thecoreideaofoptimisticspinningistopushcontendingtasksintopollorspin
insteadofsleepwhenthemutexownerisfoundtoberunning.Oncethemutex
becomesavailable(whichisexpectedtobesooner,sincetheownerisfoundto
berunning)itisassumedthataspinningtaskcouldalwaysacquireitquickeras
comparedtoasuspendedorsleepingtaskinthemutexwaitlist.However,such
spinningisonlyapossibilitywhentherearenootherhigher-prioritytasksin
readystate.Withthisfeature,spinningtasksaremorelikelytobecache-hot,
resultingindeterministicexecutionthatyieldsnoticeableperformance
improvement:
staticint__sched
__mutex_lock(structmutex*lock,longstate,unsignedintsubclass,
structlockdep_map*nest_lock,unsignedlongip)
{
return__mutex_lock_common(lock,state,subclass,nest_lock,ip,NULL,false);
}
...
...
...
staticnoinlinevoid__sched__mutex_lock_slowpath(structmutex*lock)
{
__mutex_lock(lock,TASK_UNINTERRUPTIBLE,0,NULL,_RET_IP_);
}
staticnoinlineint__sched
__mutex_lock_killable_slowpath(structmutex*lock)
{
return__mutex_lock(lock,TASK_KILLABLE,0,NULL,_RET_IP_);
}
staticnoinlineint__sched
__mutex_lock_interruptible_slowpath(structmutex*lock)
{
return__mutex_lock(lock,TASK_INTERRUPTIBLE,0,NULL,_RET_IP_);
}
The__mutex_lock_common()functioncontainsaslowpathimplementationwith
optimisticspinning;thisroutineisinvokedbyallsleepvariantsofmutexlocking
functionswithappropriateflagsasargument.Thisfunctionfirstattemptsto
acquiremutexthroughoptimisticspinningimplementedthroughcancellablemcs
spinlocks(osqfieldinmutexstructure)associatedwiththemutex.Whenthe
callertaskfailstoacquiremutexwithoptimisticspinning,asalastresortthis
functionswitchestoconventionalslowpath,resultinginthecallertasktobeput
intosleepandqueuedintothemutexwait_listuntilwokenupbytheunlockpath.
Debugchecksandvalidations
Incorrectuseofmutexoperationscancausedeadlocks,failureofexclusion,and
soon.Todetectandpreventsuchpossibleoccurrences,themutexsubsystemis
equippedwithappropriatechecksorvalidationsinstrumentedintomutex
operations.Thesechecksarebydefaultdisabled,andcanbeenabledby
choosingtheconfigurationoptionCONFIG_DEBUG_MUTEXES=yduringkernelbuild.
Followingisalistofchecksenforcedbyinstrumenteddebugcode:
Mutexcanbeownedbyonetaskatagivenpointintime
Mutexcanbereleased(unlocked)onlybythevalidowner,andanattempt
toreleasemutexbyacontextthatdoesnotownthelockwillfail
Recursivelockingorunlockingattemptswillfail
Amutexcanonlybeinitializedviatheinitializercall,andanyattemptto
memsetmutexwillneversucceed
Acallertaskmaynotexitwithamutexlockheld
Dynamicmemoryareaswhereheldlocksresidemustnotbefreed
Amutexcanbeinitializedonce,andanyattempttore-initializeanalready
initializedmutexwillfail
Mutexesmaynotbeusedinhard/softinterruptcontextroutines
Deadlockscantriggerduetomanyreasons,suchastheexecutionpatternofthe
kernelcodeandcarelessusageoflockingcalls.Forinstance,let'sconsidera
situationwhereconcurrentcodepathsneedtotakeownershipofL1andL2locks
bynestingthelockingfunctions.Itmustbeensuredthatallthekernelfunctions
thatrequiretheselocksareprogrammedtoacquiretheminthesameorder.
Whensuchorderingisnotstrictlyimposed,thereisalwaysapossibilityoftwo
differentfunctionstryingtolockL1andL2inoppositeorder,whichcould
triggerlockinversiondeadlock,whenthesefunctionsexecuteconcurrently.
Thekernellockvalidatorinfrastructurehasbeenimplementedtocheckand
provethatnoneofthelockingpatternsobservedduringkernelruntimecould
evercausedeadlock.Thisinfrastructureprintsdatapertainingtolockingpattern
suchas:
Point-of-acquiretracking,symboliclookupoffunctionnames,andlistofall
locksheldinthesystem
Ownertracking
Detectionofself-recursinglocksandprintingoutallrelevantinfo
Detectionoflockinversiondeadlocksandprintingoutallaffectedlocks
andtasks
ThelockvalidatorcanbeenabledbychoosingCONFIG_PROVE_LOCKING=yduringkernel
build.
Wait/woundmutexes
Asdiscussedintheearliersection,unorderednestedlockinginthekernel
functionscouldposeariskoflock-inversiondeadlocks,andkerneldevelopers
avoidthisbydefiningrulesfornestedlockorderingandperformruntimechecks
throughthelockvalidatorinfrastructure.Yet,therearesituationswherelock
orderingisdynamic,andnestedlockingcallscannotbehardcodedorimposedas
perpreconceivedrules.
OnesuchusecaseistodowithGPUbuffers;thesebuffersaretobeownedand
accessedbyvarioussystementitiessuchasGPUhardware,GPUdriver,user-
modeapplications,andothervideo-relateddrivers.Usermodecontextscan
submitthedmabuffersforprocessinginanarbitraryorder,andtheGPU
hardwaremayprocessthematarbitrarytimes.Iflockingisusedtocontrolthe
ownershipofthebuffers,andifmultiplebuffersmustbemanipulatedatthe
sametime,deadlockscannotbeavoided.Wait/woundmutexesaredesignedto
facilitatedynamicorderingofnestedlocks,withoutcausinglock-inversion
deadlocks.Thisisachievedbyforcingthecontextincontentiontowound,
meaningforcingittoreleasetheholdinglock.
Forinstance,let'spresumetwobuffers,eachprotectedwithalock,andfurther
considertwothreads,sayT1andT2,seekownershipofthebuffersbyattempting
locksinoppositeorder:
ThreadT1ThreadT2
=====================
lock(bufA);lock(bufB);
lock(bufB);lock(bufA);
........
........
unlock(bufB);unlock(bufA);
unlock(bufA);unlock(bufB);
ExecutionofT1andT2concurrentlymightresultineachthreadwaitingforthe
lockheldbytheother,causingdeadlock.Wait/woundmutexpreventsthisby
lettingthethreadthatgrabbedthelockfirsttoremaininsleep,waitingfor
nestedlocktobeavailable.Theotherthreadiswound,causingittoreleaseits
holdinglockandstartoveragain.SupposeT1gottolockonbufAbeforeT2could
acquirelockonbufB.T1wouldbeconsideredasthethreadthatgottherefirstand
isputtosleepforlockonbufB,andT2wouldbewound,causingittoreleaselock
onbufBandstartallover.ThisavoidsdeadlockandT2wouldstartalloverwhenT1
releaseslocksheld.
<spanclass="k">struct</span><spanclass="n">ww_mutex</span>
<spanclass="p">{</span><spanclass="k">struct</span><span
class="n">mutex</span><spanclass="n">base</span><span
class="p">;</span><spanclass="k">struct</span><span
class="n">ww_acquire_ctx</span><spanclass="o">*</span><span
class="n">ctx</span><spanclass="p">;</span><spanclass="cp">#
ifdefCONFIG_DEBUG_MUTEXES</span>
<spanclass="k">struct</span><spanclass="n">ww_class</span>
<spanclass="o">*</span><spanclass="n">ww_class</span><span
class="p">;</span><spanclass="cp">#endif</span>
<spanclass="p">};</span>
staticDEFINE_WW_CLASS(bufclass);
<spanclass="k">struct</span><spanclass="n">ww_class</span>
<spanclass="p">{</span><spanclass="n">atomic_long_t</span>
<spanclass="n">stamp</span><spanclass="p">;</span><span
class="k">struct</span><spanclass="n">lock_class_key</span>
<spanclass="n">acquire_key</span><spanclass="p">;</span>
<spanclass="k">struct</span><span
class="n">lock_class_key</span><span
class="n">mutex_key</span><spanclass="p">;</span><span
class="k">const</span><spanclass="kt">char</span><span
class="o">*</span><spanclass="n">acquire_name</span><span
class="p">;</span><spanclass="k">const</span><span
class="kt">char</span><spanclass="o">*</span><span
class="n">mutex_name</span><spanclass="p">;</span><span
class="p">};</span>
<spanclass="cm">/**</span>
<spanclass="cm">*ww_acquire_init-initializeaw/wacquire
context</span><spanclass="cm">*@ctx:w/wacquirecontextto
initialize</span>
<spanclass="cm">*@ww_class:w/wclassofthecontext</span>
<spanclass="cm">*</span>
<spanclass="cm">*Initializesacontexttoacquiremultiplemutexes
ofthegivenw/wclass.</span><spanclass="cm">*</span>
<spanclass="cm">*Context-basedw/wmutexacquiringcanbe
doneinanyorderwhatsoever<br/></span><spanclass="cm">*
withinagivenlockclass.Deadlockswillbedetectedandhandled
withthe</span><spanclass="cm">*wait/woundlogic.</span>
<spanclass="cm">*</span>
<spanclass="cm">*Mixingofcontext-basedw/wmutexacquiring
andsinglew/wmutexlocking</span><spanclass="cm">*can
resultinundetecteddeadlocksandissoforbidden.Mixing
different</span><spanclass="cm">*contextsforthesamew/w
classwhenacquiringmutexescanalsoresultin</span><span
class="cm">*undetecteddeadlocks,andishencealsoforbidden.
Bothtypesofabusewill</span><spanclass="cm">*willbecaught
byenablingCONFIG_PROVE_LOCKING.</span><span
class="cm">*</span>
<spanclass="cm">*/<br/></span><strong>void
ww_acquire_init(structww_acquire_ctx*ctx,structww_clas
*ww_class);</strong>
<spanclass="cm">/**</span>
<spanclass="cm">*ww_mutex_lock-acquirethew/w
mutex</span>
<spanclass="cm">*@lock:themutextobeacquired</span>
<spanclass="cm">*@ctx:w/wacquirecontext,orNULLtoacquire
onlyasinglelock.</span><spanclass="cm">*</span>
<spanclass="cm">*Lockthew/wmutexexclusivelyforthistask.
</span><spanclass="cm">*</span>
<spanclass="cm">*Deadlockswithinagivenw/wclassoflocksare
detectedandhandledwith</span><spanclass="cm">*wait/wound
algorithm.Ifthelockisn'timmediatelyavailablethisfunction</span>
<spanclass="cm">*willeithersleepuntilitis(waitcase)oritselects
thecurrentcontext</span><spanclass="cm">*forbackingoffby
returning-EDEADLK(woundcase).Tryingtoacquirethe</span>
<spanclass="cm">*samelockwiththesamecontexttwiceisalso
detectedandsignalledby</span><spanclass="cm">*returning-
EALREADY.Returns0ifthemutexwassuccessfullyacquired.
</span><spanclass="cm">*</span>
<spanclass="cm">*Inthewoundcasethecallermustreleaseall
currentlyheldw/wmutexes</span><spanclass="cm">*forthe
givencontextandthenwaitforthiscontendinglocktobe</span>
<spanclass="cm">*availablebycallingww_mutex_lock_slow.
</span>
<spanclass="cm">*</span>
<spanclass="cm">*Themutexmustlateronbereleasedbythe
sametaskthat</span><spanclass="cm">*acquiredit.Thetaskmay
notexitwithoutfirstunlockingthemutex.Also,</span><span
class="cm">*kernelmemorywherethemutexresidesmustnotbe
freedwiththemutex</span><spanclass="cm">*stilllocked.The
mutexmustfirstbeinitialized(orstaticallydefined)b</span><span
class="cm">*beforeitcanbelocked.memset()-ingthemutexto0is
notallowed.The</span><spanclass="cm">*mutexmustbeofthe
samew/wlockclassaswasusedtoinitializethe</span><span
class="cm">*acquiredcontext.</span>
<spanclass="cm">*Amutexacquiredwiththisfunctionmustbe
releasedwithww_mutex_unlock.</span><spanclass="cm">*/<br/>
</span><strong><spanclass="kt">int</span><span
class="n">ww_mutex_lock</span><spanclass="p">(</span><span
class="k">struct</span><spanclass="n">ww_mutex</span><span
class="o">*</span><spanclass="n">lock</span><spanclass="p">,
</span><spanclass="k">struct</span><span
class="n">ww_acquire_ctx</span><spanclass="o">*</span><span
class="n">ctx</span></strong><spanclass="p"><strong>);</strong>
<br/><br/></span><spanclass="cm">/**</span><spanclass="cm">
*ww_mutex_lock_interruptible-acquirethew/wmutex,
interruptible</span><spanclass="cm">*@lock:themutextobe
acquired</span>
<spanclass="cm">*@ctx:w/wacquirecontext</span>
<spanclass="cm">*</span>
<spanclass="cm">*/<br/></span><strong><span
class="kt">int</span><span
class="nf">ww_mutex_lock_interruptible</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">ww_mutex</span><spanclass="o">*</span><span
class="n">lock</span><spanclass="p">,<br/></span><span
class="k">struct</span><spanclass="n">ww_acquire_ctx<a
href="http://elixir.free-
electrons.com/linux/latest/ident/ww_acquire_ctx"></a></span><span
class="o">*</span><spanclass="n">ctx</span><spanclass="p">);
</span></strong>
<spanclass="cm">/**</span>
<spanclass="cm">*ww_acquire_done-markstheendofthe
acquirephase</span><spanclass="cm">*@ctx:theacquire
context</span>
<spanclass="cm">*</span>
<spanclass="cm">*Markstheendoftheacquirephase,anyfurther
w/wmutexlockcallsusing</span><spanclass="cm">*thiscontext
areforbidden.</span>
<spanclass="cm">*</span>
<spanclass="cm">*Callingthisfunctionisoptional,itisjustuseful
todocumentw/wmutex</span><spanclass="cm">*codeand
clearlydesignatedtheacquirephasefromactuallyusingthe</span>
<spanclass="cm">*lockeddatastructures.</span>
<spanclass="cm">*/</span>
<strong><spanclass="kt">void</span><span
class="nf">ww_acquire_done</span><spanclass="p">(</span>
<spanclass="k">struct</span><span
class="n">ww_acquire_ctx</span><spanclass="o">*</span><span
class="n">ctx</span><spanclass="p">);</span></strong>
<spanclass="cm">/**</span>
<spanclass="cm">*ww_acquire_fini-releasesaw/wacquire
context</span><spanclass="cm">*@ctx:theacquirecontextto
free</span>
<spanclass="cm">*</span>
<spanclass="cm">*Releasesaw/wacquirecontext.Thismustbe
called_after_allacquired</span><spanclass="cm">*w/wmutexes
havebeenreleasedwithww_mutex_unlock.</span><span
class="cm">*/</span>
<strong><spanclass="kt">void</span><span
class="nf">ww_acquire_fini</span><spanclass="p">(</span><span
class="k">struct</span><spanclass="n">ww_acquire_ctx</span>
<spanclass="o">*</span><spanclass="n">ctx</span><span
class="p">);</span></strong>
Semaphores
Untilearlyversionsof2.6kernelreleases,semaphoresweretheprimaryformof
sleeplocks.Atypicalsemaphoreimplementationcomprisesacounter,wait
queue,andsetofoperationsthatcanincrement/decrementthecounter
atomically.
Whenasemaphoreisusedtoprotectasharedresource,itscounterisinitialized
toanumbergreaterthanzero,whichisconsideredtobeunlockedstate.Atask
seekingaccesstoasharedresourcebeginsbyinvokingthedecrementoperation
onthesemaphore.Thiscallchecksthesemaphorecounter;ifitisfoundtobe
greaterthanzero,thecounterisdecrementedandthefunctionreturnssuccess.
However,ifthecounterisfoundtobezero,thedecrementoperationputsthe
callertasktosleepuntilthecounterisfoundtohaveincreasedtoanumber
greaterthanzero.
Thissimpledesignoffersgreatflexibility,whichallowsadaptabilityand
applicationofsemaphoresfordifferentsituations.Forinstance,forcaseswherea
resourceneedstobeaccessibletoaspecificnumberoftasksatanypointintime,
thesemaphorecountcanbeinitializedtothenumberoftasksthatrequireaccess,
say10,whichallowsamaximumof10tasksaccesstosharedresourceatany
time.Foryetothercases,suchasanumberoftasksthatrequiremutually
exclusiveaccesstoasharedresource,thesemaphorecountcanbeinitializedto
1,resultinginamaximumofonetasktoaccesstheresourceatanygivenpoint
intime.
Semaphorestructureanditsinterfaceoperationsaredeclaredinthekernel
header<include/linux/semaphore.h>:
structsemaphore{
raw_spinlock_tlock;
unsignedintcount;
structlist_headwait_list;
};
Spinlock(thelockfield)servesasaprotectionforcount,thatis,semaphore
operations(inc/dec)areprogrammedtoacquirelockbeforemanipulatingcount.
wait_listisusedtoqueuetaskstosleepwhiletheywaitforthesemaphorecount
toincreasebeyondzero.
Semaphorescanbedeclaredandinitializedto1throughamacro:
DEFINE_SEMAPHORE(s).
Asemaphorecanalsobeinitializeddynamicallytoanypositivenumberthrough
thefollowing:
voidsema_init(structsemaphore*sem,intval)
Followingisalistofoperationinterfaceswithabriefdescriptionofeach.
Routineswithnamingconventiondown_xxx()attempttodecrementthesemaphore,
andarepossibleblockingcalls(exceptdown_trylock()),whileroutineup()
incrementsthesemaphoreandalwayssucceeds:
/**
*down_interruptible-acquirethesemaphoreunlessinterrupted
*@sem:thesemaphoretobeacquired
*
*Attemptstoacquirethesemaphore.Ifnomoretasksareallowedto
*acquirethesemaphore,callingthisfunctionwillputthetasktosleep.
*Ifthesleepisinterruptedbyasignal,thisfunctionwillreturn-EINTR.
*Ifthesemaphoreissuccessfullyacquired,thisfunctionreturns0.
*/
intdown_interruptible(structsemaphore*sem);
/**
*down_killable-acquirethesemaphoreunlesskilled
*@sem:thesemaphoretobeacquired
*
*Attemptstoacquirethesemaphore.Ifnomoretasksareallowedto
*acquirethesemaphore,callingthisfunctionwillputthetasktosleep.
*Ifthesleepisinterruptedbyafatalsignal,thisfunctionwillreturn
*-EINTR.Ifthesemaphoreissuccessfullyacquired,thisfunctionreturns
*0.
*/
intdown_killable(structsemaphore*sem);
/**
*down_trylock-trytoacquirethesemaphore,withoutwaiting
*@sem:thesemaphoretobeacquired
*
*Trytoacquirethesemaphoreatomically.Returns0ifthesemaphorehas
*beenacquiredsuccessfullyor1ifititcannotbeacquired.
*
*/
intdown_trylock(structsemaphore*sem);
/**
*down_timeout-acquirethesemaphorewithinaspecifiedtime
*@sem:thesemaphoretobeacquired
*@timeout:howlongtowaitbeforefailing
*
*Attemptstoacquirethesemaphore.Ifnomoretasksareallowedto
*acquirethesemaphore,callingthisfunctionwillputthetasktosleep.
*Ifthesemaphoreisnotreleasedwithinthespecifiednumberofjiffies,
*thisfunctionreturns-ETIME.Itreturns0ifthesemaphorewasacquired.
*/
intdown_timeout(structsemaphore*sem,longtimeout);
/**
*up-releasethesemaphore
*@sem:thesemaphoretorelease
*
*Releasethesemaphore.Unlikemutexes,up()maybecalledfromany
*contextandevenbytaskswhichhavenevercalleddown().
*/
voidup(structsemaphore*sem);
Unlikemuteximplementation,semaphoreoperationsdonotsupportdebug
checksorvalidations;thisconstraintisduetotheirinherentgenericdesign
whichallowsthemtobeusedasexclusionlocks,eventnotificationcounters,and
soon.Eversincemutexesmadetheirwayintothekernel(2.6.16),semaphores
arenolongerthepreferredchoiceforexclusion,andtheuseofsemaphoresas
lockshasconsiderablyreduced,andforotherpurposes,thekernelhasalternate
interfaces.Mostofthekernelcodeusingsemaphoreshasbeconvertedinto
mutexeswithafewminorexceptions.Yetsemaphoresstillexistandarelikelyto
remainatleastuntilallofthekernelcodeusingthemisconvertedtomutexor
othersuitableinterfaces.
Reader-writersemaphores
Thisinterfaceisanimplementationofsleepingreader-writerexclusion,which
servesasanalternativeforspinningones.Reader-writersemaphoresare
representedbystructrw_semaphore,declaredinthekernelheader<linux/rwsem.h>:
structrw_semaphore{
atomic_long_tcount;
structlist_headwait_list;
raw_spinlock_twait_lock;
#ifdefCONFIG_RWSEM_SPIN_ON_OWNER
structoptimistic_spin_queueosq;/*spinnerMCSlock*/
/*
*Writeowner.Usedasaspeculativechecktosee
*iftheownerisrunningonthecpu.
*/
structtask_struct*owner;
#endif
#ifdefCONFIG_DEBUG_LOCK_ALLOC
structlockdep_mapdep_map;
#endif
};
Thisstructureisidenticaltothatofamutex,andisdesignedtosupport
optimisticspinningwithosq;italsoincludesdebugsupportthroughthekernel's
lockdep.Countservesasanexclusioncounter,whichissetto1,allowinga
maximumofonewritertoownthelockatapointintime.Thisworkssince
mutualexclusionisonlyenforcedbetweencontendingwriters,andanynumber
ofreaderscanconcurrentlysharethereadlock.wait_lockisaspinlockwhich
protectsthesemaphorewait_list.
Anrw_semaphorecanbeinstantiatedandinitializedstaticallythrough
DECLARE_RWSEM(name),andalternatively,itcanbedynamicallyinitializedthrough
init_rwsem(sem).
Aswiththecaseofrw-spinlocks,thisinterfacetoooffersdistinctroutinesfor
lockacquisitioninreaderandwriterpaths.Followingisalistofinterface
operations:
/*readerinterfaces*/
voiddown_read(structrw_semaphore*sem);
voidup_read(structrw_semaphore*sem);
/*trylockforreading--returns1ifsuccessful,0ifcontention*/
intdown_read_trylock(structrw_semaphore*sem);
voidup_read(structrw_semaphore*sem);
/*writerInterfaces*/
voiddown_write(structrw_semaphore*sem);
int__must_checkdown_write_killable(structrw_semaphore*sem);
/*trylockforwriting--returns1ifsuccessful,0ifcontention*/
intdown_write_trylock(structrw_semaphore*sem);
voidup_write(structrw_semaphore*sem);
/*downgradewritelocktoreadlock*/
voiddowngrade_write(structrw_semaphore*sem);
/*checkifrw-semiscurrentlylocked*/
intrwsem_is_locked(structrw_semaphore*sem);
Theseoperationsareimplementedinthesourcefile<kernel/locking/rwsem.c>;the
codeisquiteselfexplanatoryandwewillnotdiscussitanyfurther.
Sequencelocks
Conventionalreader-writerlocksaredesignedwithreaderpriority,andthey
mightcauseawritertasktowaitforanon-deterministicduration,whichmight
notbesuitableonshareddatawithtime-sensitiveupdates.Thisiswhere
sequentiallockcomesinhandy,asitaimsatprovidingaquickandlock-free
accesstosharedresources.Sequentiallocksarebestwhentheresourcethat
needstobeprotectedissmallandsimple,withwriteaccessbeingquickand
non-frequent,asinternallysequentiallocksfallbackonthespinlockprimitive.
Sequentiallocksintroduceaspecialcounterthatisincrementedeverytimea
writeracquiresasequentiallockalongwithaspinlock.Afterthewriter
completes,itreleasesthespinlockandincrementsthecounteragainandopens
theaccessforotherwriters.Forread,therearetwotypesofreaders:sequence
readersandlockingreaders.Thesequencereaderchecksforthecounterbefore
itentersthecriticalsectionandthenchecksagainattheendofitwithout
blockinganywriter.Ifthecounterremainsthesame,itimpliesthatnowriterhad
accessedthesectionduringread,butifthereisanincrementofthecounteratthe
endofthesection,itisanindicationthatawriterhadaccessed,whichcallsfor
thereadertore-readthecriticalsectionforupdateddata.Alockingreader,as
thenameimplies,willgetalockandblockotherreadersandwriterswhenitis
inprogress;itwillalsowaitwhenanotherlockingreaderorwriterisinprogress.
Asequencelockisrepresentedbythefollowingtype:
typedefstruct{
structseqcountseqcount;
spinlock_tlock;
}seqlock_t;
Wecaninitializeasequencelockstaticallyusingthefollowingmacro:
#defineDEFINE_SEQLOCK(x)\
seqlock_tx=__SEQLOCK_UNLOCKED(x)
Actualinitializationisdoneusingthe__SEQLOCK_UNLOCKED(x),whichisdefinedhere:
#define__SEQLOCK_UNLOCKED(lockname)\
{\
.seqcount=SEQCNT_ZERO(lockname),\
.lock=__SPIN_LOCK_UNLOCKED(lockname)\
}
Todynamicallyinitializesequencelock,weneedtousetheseqlock_initmacro,
whichisdefinedasfollows:
#defineseqlock_init(x)\
do{\
seqcount_init(&(x)->seqcount);\
spin_lock_init(&(x)->lock);\
}while(0)
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><spanclass="nf">write_seqlock</span>
<spanclass="p">(</span><spanclass="n">seqlock_t</span><span
class="o">*</span><spanclass="n">sl</span><spanclass="p">)
</span>
<spanclass="p">{</span>
<spanclass="n">spin_lock</span><spanclass="p">(</span><span
class="o">&</span><spanclass="n">sl</span><spanclass="o">->
</span><spanclass="n">lock</span><spanclass="p">);</span>
<spanclass="n">write_seqcount_begin</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">);</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><spanclass="nf">write_sequnlock</span>
<spanclass="p">(</span><spanclass="n">seqlock_t</span><span
class="o">*</span><spanclass="n">sl</span><spanclass="p">)
</span>
<spanclass="p">{</span>
<spanclass="n">write_seqcount_end</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">);</span>
<spanclass="n">spin_unlock</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">lock</span><spanclass="p">);
</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><spanclass="nf">write_seqlock_bh</span>
<spanclass="p">(</span><spanclass="n">seqlock_t</span><span
class="o">*</span><spanclass="n">sl</span><spanclass="p">)
</span>
<spanclass="p">{</span>
<spanclass="n">spin_lock_bh</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">lock</span><spanclass="p">);
</span>
<spanclass="n">write_seqcount_begin</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">);</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><span
class="nf">write_sequnlock_bh</span><spanclass="p">(</span>
<spanclass="n">seqlock_t</span><spanclass="o">*</span><span
class="n">sl</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="n">write_seqcount_end</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">);</span>
<spanclass="n">spin_unlock_bh</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">lock</span><spanclass="p">);
</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><spanclass="nf">write_seqlock_irq</span>
<spanclass="p">(</span><spanclass="n">seqlock_t</span><span
class="o">*</span><spanclass="n">sl</span><spanclass="p">)
</span>
<spanclass="p">{</span>
<spanclass="n">spin_lock_irq</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">lock</span><spanclass="p">);
</span>
<spanclass="n">write_seqcount_begin</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">);</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><span
class="nf">write_sequnlock_irq</span><spanclass="p">(</span>
<spanclass="n">seqlock_t</span><spanclass="o">*</span><span
class="n">sl</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="n">write_seqcount_end</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">);</span>
<spanclass="n">spin_unlock_irq</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">lock</span><spanclass="p">);
</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="nf">__write_seqlock_irqsave</span><spanclass="p">
(</span><spanclass="n">seqlock_t</span><spanclass="o">*
</span><spanclass="n">sl</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">flags</span><spanclass="p">;</span>
<spanclass="n">spin_lock_irqsave</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">lock</span><spanclass="p">,
</span><spanclass="n">flags</span><spanclass="p">);</span>
<spanclass="n">write_seqcount_begin</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">);</span>
<spanclass="k">return</span><spanclass="n">flags</span>
<spanclass="p">;</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">unsigned</span><spanclass="nf">read_seqbegin</span>
<spanclass="p">(</span><spanclass="k">const</span><span
class="n">seqlock_t</span><spanclass="o">*</span><span
class="n">sl</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="k">return</span><span
class="n">read_seqcount_begin</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">);</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">unsigned</span><spanclass="nf">read_seqretry</span>
<spanclass="p">(</span><spanclass="k">const</span><span
class="n">seqlock_t</span><spanclass="o">*</span><span
class="n">sl</span><spanclass="p">,</span><span
class="kt">unsigned</span><spanclass="n">start</span><span
class="p">)</span>
<spanclass="p">{</span>
<spanclass="k">return</span><span
class="n">read_seqcount_retry</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">sl</span><span
class="o">-></span><spanclass="n">seqcount</span><span
class="p">,</span><spanclass="n">start</span><spanclass="p">);
</span>
<spanclass="p">}</span>
Completionlocks
Completionlocksareanefficientwaytoachievecodesynchronizationifyou
needoneormultiplethreadsofexecutiontowaitforcompletionofsomeevent,
suchaswaitingforanotherprocesstoreachapointorstate.Completionlocks
maybepreferredoverasemaphoreforacoupleofreasons:multiplethreadsof
executioncanwaitforacompletion,andusingcomplete_all(),theycanallbe
releasedatonce.Thisiswaybetterthanasemaphorewakinguptomultiple
threads.Secondly,semaphorescanleadtoraceconditionsifawaitingthread
deallocatesthesynchronizationobject;thisproblemdoesn’texistwhenusing
completion.
Completioncanbeusedbyincluding<linux/completion.h>andbycreatinga
variableoftypestructcompletion,whichisanopaquestructureformaintainingthe
stateofcompletion.ItusesaFIFOtoqueuethethreadswaitingforthe
completionevent:
structcompletion{
unsignedintdone;
wait_queue_head_twait;
};
Completionbasicallyconsistsofinitializingthecompletionstructure,waiting
throughanyofthevariantsofwait_for_completion()call,andfinallysignallingthe
completionthroughcomplete()orthecomplete_all()call.Therearealsofunctionsto
checkthestateofcompletionsduringitslifetime.
<spanclass="cp">#defineDECLARE_COMPLETION(work)\
</span><spanclass="cp">structcompletionwork=
COMPLETION_INITIALIZER(work)</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><spanclass="nf">init_completion</span>
<spanclass="p">(</span><spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">)</span><spanclass="p">
{</span>
<spanclass="n">x</span><spanclass="o">-></span><span
class="n">done</span><spanclass="o">=</span><span
class="mi">0</span><spanclass="p">;</span><span
class="n">init_waitqueue_head</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">x</span><span
class="o">-></span><spanclass="n">wait</span><spanclass="p">);
</span><spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><spanclass="nf">reinit_completion</span>
<spanclass="p">(</span><spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">)</span><spanclass="p">
{</span>
<spanclass="n">x</span><spanclass="o">-></span><span
class="n">done</span><spanclass="o">=</span><span
class="mi">0</span><spanclass="p">;</span><spanclass="p">}
</span>
<spanclass="k">extern</span><spanclass="kt">void</span><span
class="nf">wait_for_completion_io</span><spanclass="p">(</span>
<spanclass="k">struct</span><spanclass="n">completion</span>
<spanclass="o">*</span><spanclass="p">);</span><span
class="k">extern</span><spanclass="kt">int</span><span
class="nf">wait_for_completion_interruptible</span><span
class="p">(</span><spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">);</span><span
class="k">extern</span><spanclass="kt">int</span><span
class="nf">wait_for_completion_killable</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">);</span><span
class="k">extern</span><spanclass="kt">unsigned</span><span
class="kt">long</span><span
class="nf">wait_for_completion_timeout</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">,</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">timeout</span><spanclass="p">);</span>
<spanclass="k">extern</span><spanclass="kt">unsigned</span>
<spanclass="kt">long</span><span
class="nf">wait_for_completion_io_timeout</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">,</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">timeout</span><spanclass="p">);</span>
<spanclass="k">extern</span><spanclass="kt">long</span><span
class="nf">wait_for_completion_interruptible_timeout</span><span
class="p">(</span>
<spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">,</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">timeout</span><spanclass="p">);</span><span
class="k">extern</span><spanclass="kt">long</span><span
class="nf">wait_for_completion_killable_timeout</span><span
class="p">(</span>
<spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">,</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">timeout</span><spanclass="p">);</span><span
class="k">extern</span><spanclass="kt">bool</span><span
class="nf">try_wait_for_completion</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">);</span><span
class="k">extern</span><spanclass="kt">bool</span><span
class="nf">completion_done</span><spanclass="p">(</span><span
class="k">struct</span><spanclass="n">completion</span><span
class="o">*</span><spanclass="n">x</span><spanclass="p">);
</span>
<spanclass="k">extern</span><spanclass="kt">void</span><span
class="nf">complete</span><spanclass="p">(</span><span
class="k">struct</span><spanclass="n">completion</span><span
class="o">*</span><spanclass="p">);</span>
<spanclass="k">extern</span><spanclass="kt">void</span><span
class="nf">complete_all</span><spanclass="p">(</span><span
class="k">struct</span><spanclass="n">completion</span><span
class="o">*</span><spanclass="p">);</span>
<spanclass="kt">void</span><spanclass="nf">complete</span>
<spanclass="p">(</span><spanclass="k">struct</span><span
class="n">completion</span><spanclass="o">*</span><span
class="n">x</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">flags</span><spanclass="p">;</span>
<spanclass="n">spin_lock_irqsave</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">x</span><span
class="o">-></span><spanclass="n">wait</span><spanclass="p">.
</span><spanclass="n">lock</span><spanclass="p">,</span>
<spanclass="n">flags</span><spanclass="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">x</span><spanclass="o">-></span><span
class="n">done</span><spanclass="o">!=</span><span
class="n">UINT_MAX</span><spanclass="p">)</span>
<spanclass="n">x</span><spanclass="o">-></span><span
class="n">done</span><spanclass="o">++</span><span
class="p">;</span>
<spanclass="n">__wake_up_locked</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">x</span><span
class="o">-></span><spanclass="n">wait</span><spanclass="p">,
</span><spanclass="n">TASK_NORMAL</span><span
class="p">,</span><spanclass="mi">1</span><spanclass="p">);
</span>
<spanclass="n">spin_unlock_irqrestore</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">x</span><span
class="o">-></span><spanclass="n">wait</span><spanclass="p">.
</span><spanclass="n">lock</span><spanclass="p">,</span>
<spanclass="n">flags</span><spanclass="p">);</span>
<spanclass="p">}</span>
<spanclass="n">EXPORT_SYMBOL</span><spanclass="p">
(</span><spanclass="n">complete</span><spanclass="p">);<br/>
</span><spanclass="kt">void</span><span
class="nf">complete_all</span><spanclass="p">(</span><span
class="k">struct</span><spanclass="n">completion</span><span
class="o">*</span><spanclass="n">x</span><spanclass="p">)
</span>
<spanclass="p">{</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">flags</span><spanclass="p">;</span>
<spanclass="n">spin_lock_irqsave</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">x</span><span
class="o">-></span><spanclass="n">wait</span><spanclass="p">.
</span><spanclass="n">lock</span><spanclass="p">,</span>
<spanclass="n">flags</span><spanclass="p">);</span>
<spanclass="n">x</span><spanclass="o">-></span><span
class="n">done</span><spanclass="o">=</span><span
class="n">UINT_MAX</span><spanclass="p">;</span>
<spanclass="n">__wake_up_locked</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">x</span><span
class="o">-></span><spanclass="n">wait</span><spanclass="p">,
</span><spanclass="n">TASK_NORMAL</span><span
class="p">,</span><spanclass="mi">0</span><spanclass="p">);
</span>
<spanclass="n">spin_unlock_irqrestore</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">x</span><span
class="o">-></span><spanclass="n">wait</span><spanclass="p">.
</span><spanclass="n">lock</span><spanclass="p">,</span>
<spanclass="n">flags</span><spanclass="p">);</span>
<spanclass="p">}</span>
<spanclass="n">EXPORT_SYMBOL</span><spanclass="p">
(</span><spanclass="n">complete_all</span><spanclass="p">);
</span>
Summary
Throughoutthischapter,wenotonlyunderstoodthevariousprotectionand
synchronizationmechanismsprovidedbythekernel,butalsomadean
underlyingattemptatappreciatingtheeffectivenessoftheseoptions,withtheir
variedfunctionalitiesandshortcomings.Ourtakeawayfromthischapterhasto
bethetenacitywithwhichthekerneladdressesthesevaryingcomplexitiesfor
providingprotectionandsynchronizationofdata.Anothernotablefactremains
inthewaythekernelmaintainseaseofcodingalongwithdesignpanachewhen
tacklingtheseissues.
Inournextchapter,wewilllookatanothercrucialaspectofhowinterruptsare
handledbythekernel.
InterruptsandDeferredWork
Aninterruptisanelectricalsignaldeliveredtotheprocessorindicating
occurrenceofasignificanteventthatneedsimmediateattention.Thesesignals
canoriginateeitherfromexternalhardware(connectedtothesystem)orfrom
circuitswithintheprocessor.Inthischapterwewilllookintothekernel's
interruptmanagementsubsystemandexplorethefollowing:
Programmableinterruptcontrollers
Interruptvectortable
IRQs
IRQchipandIRQdescriptors
Registeringandunregisteringinterrupthandlers
IRQline-controloperations
IRQstacks
Needfordeferredroutines
Softirqs
Tasklets
Workqueues
Interruptsignalsandvectors
Whenaninterruptoriginatesfromanexternaldevice,itisreferredtoasa
hardwareinterrupt.Thesesignalsaregeneratedbyexternalhardwaretoseek
theattentionoftheprocessoronoccurrenceofasignificantexternalevent,for
instanceakeyhitonthekeyboard,aclickonamousebutton,ormovingthe
mousetriggerhardwareinterruptsthroughwhichtheprocessorisnotifiedabout
theavailabilityofdatatoberead.Hardwareinterruptsoccurasynchronously
withrespecttotheprocessorclock(meaningtheycanoccuratrandomtimes),
andhencearealsotermedasasynchronousinterrupts.
InterruptstriggeredfromwithintheCPUduetoeventsgeneratedbyprogram
instructionscurrentlyinexecutionarereferredtoassoftwareinterrupts.A
softwareinterruptiscausedeitherbyanexceptiontriggeredbyprogram
instructionscurrentlyinexecutionoronexecutionofaprivilegedinstruction
thatraisesaninterrupt.Forinstance,whenaprograminstructionattemptsto
divideanumberbyzero,thearithmeticlogicunitoftheprocessorraisesan
interruptcalledadivide-by-zeroexception.Similarly,whenaprogramin
executionintendstoinvokeakernelservicecall,itexecutesaspecialinstruction
(sysenter)thatraisesaninterrupttoshifttheprocessorintoprivilegedmode,
whichpavesthepathfortheexecutionofthedesiredservicecall.Theseevents
occursynchronouslywithrespecttotheprocessor'sclockandhencearealso
calledsynchronousinterrupts.
Inresponsetotheoccurrenceofaninterruptevent,CPUsaredesignedto
preemptthecurrentinstructionsequenceorthreadofexecution,andexecutea
specialfunctioncalledinterruptserviceroutine(ISR).Tolocatethe
appropriateISRthatcorrespondstoaninterruptevent,interruptvectortables
areused.Aninterruptvectorisanaddressinmemorythatcontainsareference
toasoftware-definedinterruptservicetobeexecutedinresponsetoan
interrupt.Processorarchitecturesdefinethetotalcountofinterruptvectors
supported,anddescribethelayoutofeachinterruptvectorinmemory.In
general,formostprocessorarchitectures,allsupportedvectorsaresetupin
memoryasalistcalledaninterruptvectortable,whoseaddressisprogrammed
intoaprocessorregisterbytheplatformsoftware.
Let'sconsiderspecificsofthex86architectureasanexampleforbetter
understanding.Thex86familyofprocessorssupportsatotalof256interrupt
vectors,ofwhichthefirst32arereservedforprocessorexceptionsandtherest
usedforsoftwareandhardwareinterrupts.Implementationofavectortableby
x86isreferredtoasaninterruptdescriptortable(IDT),whichisanarrayof
descriptorsofeither8byte(for32-bitmachines)or16byte(for64-bitx86
machines)sizes.Duringearlyboot,thearchitecture-specificbranchofthekernel
codesetsuptheIDTinmemoryandprogramstheIDTRregister(specialx86
register)oftheprocessorwiththephysicalstartaddressandlengthoftheIDT.
Whenaninterruptoccurs,theprocessorlocatesrelevantvectordescriptorsby
multiplyingthereportedvectornumberbythesizeofthevectordescriptor
(vectornumberx8onx86_32machines,andvectornox16onx86_64
machines)andaddingtheresulttothebaseaddressoftheIDT.Onceavalid
vectordescriptorisreached,theprocessorcontinueswiththeexecutionof
actionsspecifiedwithinthedescriptor.
Onx86platforms,eachvectordescriptorimplementsagate
(interrupt,task,ortrap),whichisusedtotransfercontrolof
executionacrosssegments.Vectordescriptorsrepresenting
hardwareinterruptsimplementaninterruptgate,whichrefersto
thebaseaddressandoffsetofthesegmentcontaininginterrupt
handlercode.Aninterruptgatedisablesallmaskableinterrupts
beforepassingcontroltoaspecifiedinterrupthandler.Vector
descriptorsrepresentingexceptionsandsoftwareinterrupts
implementatrapgate,whichalsoreferstothelocationofcode
designatedasahandlerfortheevent.Unlikeaninterruptgate,a
trapgatedoesnotdisablemaskableinterrupts,whichmakesit
suitableforexecutionofsoftinterrupthandlers.
Programmableinterruptcontroller
Nowlet'sfocusonexternalinterruptsandexplorehowprocessorsidentifythe
occurrenceofanexternalhardwareinterrupt,andhowtheydiscoverthevector
numberassociatedwiththeinterrupt.CPUsaredesignedwithadedicatedinput
pin(intrpin)usedtosignalexternalinterrupts.Eachexternalhardwaredevice
capableofissuinginterruptrequestsusuallyconsistsofoneormoreoutputpins
calledInterruptRequestlines(IRQ),usedtosignalaninterruptrequestonthe
CPU.Allcomputingplatformsuseahardwarecircuitcalledaprogrammable
interruptcontroller(PIC)tomultiplextheCPU'sinterruptpinacrossvarious
interruptrequestlines.AlloftheexistingIRQlinesoriginatingfromon-board
devicecontrollersareroutedtoinputpinsoftheinterruptcontroller,which
monitorseachIRQlineforaninterruptsignal,anduponarrivalofaninterrupt,
convertstherequestintoacpu-understandablevectornumberandrelaysthe
interruptsignalontotheCPU'sinterruptpin.Insimplewords,aprogrammable
interruptcontrollermultiplexesmultipledeviceinterruptrequestlinesintoa
singleinterruptlineoftheprocessor:
Designandimplementationofinterruptcontrollersisplatformspecific.Intelx86
multiprocessorplatformsuseAdvancedProgrammableInterruptController
(APIC).TheAPICdesignsplitsinterruptcontrollerfunctionalityintotwo
distinctchipsets:thefirstcomponentisanI/OAPICthatresidesonthesystem
bus.AllsharedperipheralhardwareIRQlinesareroutedtotheI/OAPIC;this
chiptranslatesaninterruptrequestintovectorcode.Thesecondisaper-CPU
controllercalledLocalAPIC(usuallyintegratedintotheprocessorcore)which
delivershardwareinterruptstospecificCPUcores.I/OAPICroutesthe
interrupteventstoaLocalAPICofthechosenCPUcore.Itisprogrammedwith
aredirectiontable,whichisusedformakinginterruptroutingdecisions.CPU
LocalAPICsmanageallexternalinterruptsforaspecificCPUcore;
additionally,theydelivereventsfromCPUlocalhardwaresuchastimersand
canalsoreceiveandgenerateinter-processorinterrupts(IPIs)thatcanoccur
onanSMPplatform.
ThefollowingdiagramdepictsthesplitarchitectureofAPIC.Theflowof
eventsnowbeginswithindividualdevicesraisingIRQontheI/OAPIC,which
routestherequesttoaspecificLocalAPIC,whichinturndeliverstheinterrupt
toaspecificCPUcore:
SimilartotheAPICarchitecture,multicoreARMplatformssplitthegeneric
interruptcontroller(GIC)implementationintotwo.Thefirstcomponentis
calledadistributor,whichisglobaltothesystemandhasseveralperipheral
hardwareinterruptsourcesphysicallyroutedtoit.Thesecondcomponentis
replicatedper-CPUandiscalledthecpuinterface.Thedistributorcomponentis
programmedwithdistributionlogicofsharedperipheralinterrupts(SPI)to
knownCPUinterfaces.
<spanclass="k">struct</span><spanclass="n">irq_chip</span>
<spanclass="p">{</span>
<spanclass="k">struct</span><spanclass="n">device</span>
<spanclass="o">*</span><spanclass="n">parent_device</span>
<spanclass="p">;</span><spanclass="k">const</span><span
class="kt">char</span><spanclass="o">*</span><span
class="n">name</span><spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="nf">int</span>
<spanclass="p">(</span><spanclass="o">*</span><span
class="n">irq_startup</span><spanclass="p">)(</span><span
class="k">struct</span><spanclass="n">irq_data</span><span
class="o">*</span><spanclass="n">data</span><spanclass="p">);
</span><spanclass="kt">void</span><spanclass="p">(</span>
<spanclass="o">*</span><spanclass="n">irq_shutdown</span>
<spanclass="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">irq_enable</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">irq_disable</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">irq_ack</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">irq_mask</span><spanclass="p">)(</span>
<spanclass="k">struct</span><spanclass="n">irq_data</span>
<spanclass="o">*</span><spanclass="n">data</span><span
class="p">);</span><spanclass="kt">void</span><spanclass="p">
(</span><spanclass="o">*</span><span
class="n">irq_mask_ack</span><spanclass="p">)(</span><span
class="k">struct</span><spanclass="n">irq_data</span><span
class="o">*</span><spanclass="n">data</span><spanclass="p">);
</span><spanclass="kt">void</span><spanclass="p">(</span>
<spanclass="o">*</span><spanclass="n">irq_unmask</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">irq_eoi</span><spanclass="p">)(</span>
<spanclass="k">struct</span><spanclass="n">irq_data</span>
<spanclass="o">*</span><spanclass="n">data</span><span
class="p">);</span>
<spanclass="kt">int</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">irq_set_affinity</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">,</span><span
class="k">const</span><spanclass="k">struct</span><span
class="n">cpumask<br/></span><spanclass="o">*</span><span
class="n">dest</span><spanclass="p">,</span><span
class="kt">bool</span><spanclass="n">force</span><span
class="p">);<br/></span><spanclass="kt"><br/>int</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">irq_retrigger</span><spanclass="p">)(</span><span
class="k">struct</span><spanclass="n">irq_data</span><span
class="o">*</span><spanclass="n">data</span><spanclass="p">);
</span><br/><spanclass="kt">int</span><spanclass="p">
(</span><spanclass="o">*</span><span
class="n">irq_set_type</span><spanclass="p">)(</span><span
class="k">struct</span><spanclass="n">irq_data</span><span
class="o">*</span><spanclass="n">data</span><spanclass="p">,
</span><spanclass="kt">unsigned</span><spanclass="kt">int
</span><spanclass="n">flow_type</span><spanclass="p">);<br/>
</span><spanclass="kt">int</span><spanclass="p">(</span>
<spanclass="o">*</span><spanclass="n">irq_set_wake</span>
<spanclass="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">,</span><span
class="kt">unsigned</span><spanclass="kt">inton);</span><br/>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">irq_bus_lock</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><br/><span
class="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">irq_bus_sync_unlock</span>
<spanclass="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><br/><span
class="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">irq_cpu_online</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><br/><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">irq_cpu_offline</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><br/><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">irq_suspend</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><br/><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">irq_resume</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><br/><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">irq_pm_shutdown</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><br/><span
class="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">irq_calc_mask</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">);</span><br/><span
class="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">irq_print_chip</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">,</span><span
class="k">struct</span><spanclass="n">seq_file</span><span
class="o">*</span><spanclass="n">p</span><spanclass="p">);
</span><br/><spanclass="kt">int</span><spanclass="p">
(</span><spanclass="o">*</span><span
class="n">irq_request_resources</span><spanclass="p">)(</span>
<spanclass="k">struct</span><spanclass="n">irq_data</span>
<spanclass="o">*</span><spanclass="n">data</span><span
class="p">);</span><br/><spanclass="kt">void</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">irq_release_resources</span><spanclass="p">)(</span>
<spanclass="k">struct</span><spanclass="n">irq_data</span>
<spanclass="o">*</span><spanclass="n">data</span><span
class="p">);</span><br/><spanclass="kt">void</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">irq_compose_msi_msg</span><spanclass="p">)(</span>
<spanclass="k">struct</span><spanclass="n">irq_data</span>
<spanclass="o">*</span><spanclass="n">data</span><span
class="p">,</span><spanclass="k">struct</span><span
class="n">msi_msg</span><spanclass="o">*</span><span
class="n">msg</span><spanclass="p">);<br/></span><span
class="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">irq_write_msi_msg</span>
<spanclass="p">)(</span><spanclass="k">struct</span><span
class="n">irq_data</span><spanclass="o">*</span><span
class="n">data</span><spanclass="p">,</span><span
class="k">struct</span><spanclass="n">msi_msg</span><span
class="o">*</span><spanclass="n">msg</span><spanclass="p">);
</span><br/><br/>int(*irq_get_irqchip_state)(structirq_data*data,
enumirqchip_irq_statewhich,bool*state);<br/>int
(*irq_set_irqchip_state)(structirq_data*data,enumirqchip_irq_state
which,boolstate);<br/><br/><spanclass="kt">int</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">irq_set_vcpu_affinity</span><spanclass="p">)(</span>
<spanclass="k">struct</span><spanclass="n">irq_data</span>
<spanclass="o">*</span><spanclass="n">data</span><span
class="p">,</span><spanclass="kt">void</span><spanclass="o">*
</span><spanclass="n">vcpu_info</span><spanclass="p">);
</span><br/><spanclass="kt">void</span><spanclass="p">
(</span><spanclass="o">*</span><span
class="n">ipi_send_single</span><spanclass="p">)(</span><span
class="k">struct</span><spanclass="n">irq_data</span><span
class="o">*</span><spanclass="n">data</span><spanclass="p">,
</span><spanclass="kt">unsigned</span><span
class="kt">int</span><spanclass="n">cpu</span><span
class="p">);</span><br/><spanclass="kt">void</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">ipi_send_mask</span><spanclass="p">)(</span><span
class="k">struct</span><spanclass="n">irq_data</span><span
class="o">*</span><spanclass="n">data</span><spanclass="p">,
</span><spanclass="k">const</span><span
class="k">struct</span><spanclass="n">cpumask</span><span
class="o">*</span><spanclass="n">dest</span><spanclass="p">);
</span><spanclass="kt">unsigned</span><spanclass="kt">long
</span><spanclass="n">flags</span><spanclass="p">;</span>
<br/><spanclass="p">};</span>
<spanclass="k">static</span><spanclass="k">struct</span><span
class="n">irq_chip</span><spanclass="n">ioapic_chip</span>
<spanclass="n">__read_mostly</span><spanclass="o">=</span>
<spanclass="p">{</span><spanclass="p">.</span><span
class="n">name</span><spanclass="o">=</span><span
class="s">"IO-APIC"</span><spanclass="p">,</span>
<spanclass="p">.</span><spanclass="n">irq_startup</span>
<spanclass="o">=</span><span
class="n">startup_ioapic_irq</span><spanclass="p">,</span><span
class="p">.</span><spanclass="n">irq_mask</span><span
class="o">=</span><spanclass="n">mask_ioapic_irq</span><span
class="p">,</span><spanclass="p">.</span><span
class="n">irq_unmask</span><spanclass="o">=</span><span
class="n">unmask_ioapic_irq</span><spanclass="p">,</span>
<spanclass="p">.</span><spanclass="n">irq_ack</span><span
class="o">=</span><spanclass="n">irq_chip_ack_parent</span>
<spanclass="p">,</span><spanclass="p">.</span><span
class="n">irq_eoi</span><spanclass="o">=</span><span
class="n">ioapic_ack_level</span><spanclass="p">,</span><span
class="p">.</span><spanclass="n">irq_set_affinity</span><span
class="o">=</span><spanclass="n">ioapic_set_affinity</span>
<spanclass="p">,</span><spanclass="p">.</span><span
class="n">irq_retrigger</span><spanclass="o">=</span><span
class="n">irq_chip_retrigger_hierarchy</span><spanclass="p">,
</span><spanclass="p">.</span><spanclass="n">flags</span>
<spanclass="o">=</span><span
class="n">IRQCHIP_SKIP_SET_WAKE</span><spanclass="p">,
</span><spanclass="p">};<br/><br/><br/></span><span
class="k">static</span><spanclass="k">struct</span><span
class="n">irq_chip</span><spanclass="n">lapic_chip</span>
<spanclass="n">__read_mostly</span><spanclass="o">=</span>
<spanclass="p">{</span><spanclass="p">.</span><span
class="n">name</span><spanclass="o">=</span><span
class="s">"local-APIC"</span><spanclass="p">,</span>
<spanclass="p">.</span><spanclass="n">irq_mask</span><span
class="o">=</span><spanclass="n">mask_lapic_irq</span><span
class="p">,</span>
<spanclass="p">.</span><spanclass="n">irq_unmask</span>
<spanclass="o">=</span><span
class="n">unmask_lapic_irq</span><spanclass="p">,</span><span
class="p">.</span><spanclass="n">irq_ack</span><span
class="o">=</span><spanclass="n">ack_lapic_irq</span><span
class="p">,</span>
<spanclass="p">};</span>
<spanclass="cm">/**</span>
<spanclass="cm">*structirq_data-perirqchipdatapasseddown
tochipfunctions</span><spanclass="cm">*@mask:</span>
<spanclass="cm">precomputedbitmaskforaccessingthechip
registers</span><spanclass="cm">*@irq:</span><span
class="cm">interruptnumber</span><spanclass="cm">*@hwirq:
</span><spanclass="cm">hardwareinterruptnumber,localtothe
interruptdomain</span><spanclass="cm">*@common:</span>
<spanclass="cm">pointtodatasharedbyallirqchips</span><span
class="cm">*@chip:</span><spanclass="cm">lowlevelinterrupt
hardwareaccess</span><spanclass="cm">*@domain:</span>
<spanclass="cm">Interrupttranslationdomain;responsiblefor
mapping</span><spanclass="cm">*</span><span
class="cm">betweenhwirqnumberandlinuxirqnumber.</span>
<spanclass="cm">*@parent_data:pointertoparentstructirq_data
tosupporthierarchy</span><spanclass="cm">*</span><span
class="cm">irq_domain</span><spanclass="cm">*@chip_data:
</span><spanclass="cm">platform-specificper-chipprivatedatafor
thechip</span><spanclass="cm">*</span><span
class="cm">methods,toallowsharedchipimplementations</span>
<spanclass="cm">*/</span><spanclass="k"><br/>
<br/>struct</span><spanclass="n">irq_data</span><span
class="p">{</span><br/><spanclass="n">u32</span><span
class="n">mask</span><spanclass="p">;</span><br/><span
class="kt">unsigned</span><spanclass="kt">int</span><span
class="n">irq</span><spanclass="p">;</span><br/><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">hwirq</span><spanclass="p">;</span><br/><span
class="k">struct</span><spanclass="n">irq_common_data</span>
<spanclass="o">*</span><spanclass="n">common</span><span
class="p">;</span><br/><spanclass="k">struct</span><span
class="n">irq_chip</span><spanclass="o">*</span><span
class="n">chip</span><spanclass="p">;</span><br/><span
class="k">struct</span><spanclass="n">irq_domain</span><span
class="o">*</span><spanclass="n">domain</span><span
class="p">;</span><br/><spanclass="cp">#ifdef
CONFIG_IRQ_DOMAIN_HIERARCHY</span><br/><span
class="k">struct</span><spanclass="n">irq_data</span><span
class="o">*</span><spanclass="n">parent_data</span><span
class="p">;</span><br/><spanclass="cp">#endif</span><br/>
<spanclass="kt">void</span><spanclass="o">*</span><span
class="n">chip_data<ahref="http://elixir.free-
electrons.com/linux/latest/ident/chip_data"></a></span><span
class="p">;</span><br/><spanclass="p">};</span>
/**<br/>*structirqaction-perinterruptactiondescriptor<br/>*
@handler:interrupthandlerfunction<br/>*@name:nameofthe
device<br/>*@dev_id:cookietoidentifythedevice<br/>*
@percpu_dev_id:cookietoidentifythedevice<br/>*@next:pointer
tothenextirqactionforsharedinterrupts<br/>*@irq:interrupt
number<br/>*@flags:flags<br/>*@thread_fn:interrupthandler
functionforthreadedinterrupts<br/>*@thread:threadpointerfor
threadedinterrupts<br/>*@secondary:pointertosecondaryirqaction
(forcethreading)<br/>*@thread_flags:flagsrelatedto@thread<br/>
*@thread_mask:bitmaskforkeepingtrackof@threadactivity<br/>
*@dir:pointertotheproc/irq/NN/nameentry<br/>*/<br/>struct
irqaction{
irq_handler_thandler;
void*dev_id;
void__percpu*percpu_dev_id;structirqaction*next;
irq_handler_tthread_fn;
structtask_struct*thread;structirqaction*secondary;unsignedint
irq;
unsignedintflags;
unsignedlongthread_flags;unsignedlongthread_mask;constchar
*name;
structproc_dir_entry*dir;};<span></span><span></span>
High-levelinterrupt-management
interfaces
ThegenericIRQlayerprovidesasetoffunctioninterfacesfordevicedriversto
grabIRQdescriptorsandbindinterrupthandlers,releaseIRQs,enableordisable
interruptlines,andsoon.Wewillexploreallofthegenericinterfacesinthis
section.
Registeringaninterrupthandler
typedefirqreturn_t(*irq_handler_t)(int,void*);
/**
*request_irq-allocateaninterruptline
*@irq:Interruptlinetoallocate
*@handler:FunctiontobecalledwhentheIRQoccurs.
*@irqflags:Interrupttypeflags
*@devname:Anasciinamefortheclaimingdevice
*@dev_id:Acookiepassedbacktothehandlerfunction
*/
intrequest_irq(unsignedintirq,irq_handler_thandler,unsignedlongflags,
constchar*name,void*dev);
request_irq()instantiatesanirqactionobjectwithvaluespassedasparametersand
bindsittotheirq_descspecifiedasthefirst(irq)parameter.Thiscallallocates
interruptresourcesandenablestheinterruptlineandIRQhandling.handlerisa
functionpointeroftypeirq_handler_t,whichtakestheaddressofadriver-specific
interrupthandlerroutine.flagsisabitmaskofoptionsrelatedtointerrupt
management.Flagbitsaredefinedinthekernelheader<linux/interrupt.h>:
IRQF_SHARED:UsedwhilebindinganinterrupthandlertoasharedIRQline.
IRQF_PROBE_SHARED:Setbycallerswhentheyexpectsharingmismatchesto
occur.
IRQF_TIMER:Flagtomarkthisinterruptasatimerinterrupt.
IRQF_PERCPU:InterruptisperCPU.
IRQF_NOBALANCING:FlagtoexcludethisinterruptfromIRQbalancing.
IRQF_IRQPOLL:Interruptisusedforpolling(onlytheinterruptthatisregistered
firstinasharedinterruptisconsideredforperformancereasons).
IRQF_NO_SUSPEND:DonotdisablethisIRQduringsuspend.Doesnotguarantee
thatthisinterruptwillwakethesystemfromasuspendedstate.
IRQF_FORCE_RESUME:Force-enableitonresumeevenifIRQF_NO_SUSPENDisset.
IRQF_EARLY_RESUME:ResumeIRQearlyduringsyscoreinsteadofatdevice
resumetime.
IRQF_COND_SUSPEND:IftheIRQissharedwithaNO_SUSPENDuser,executethis
interrupthandleraftersuspendinginterrupts.Forsystemwakeupdevices,
usersneedtoimplementwakeupdetectionintheirinterrupthandlers.
Sinceeachflagvalueisabit,alogicalOR(thatis,|)ofasubsetofthesecanbe
passed,andifnoneapply,thenavalue0fortheflagsparameterisvalid.The
addressassignedtodevisconsideredasauniquecookieandservesasan
identifierfortheactioninstanceinasharedIRQcase.Thevalueofthis
parametercanbeNULLwhileregisteringinterrupthandlerswithoutthe
IRQF_SHAREDflag.
Onsuccess,request_irq()returnszero;anonzeroreturnvalueindicatesfailureto
registerthespecifiedinterrupthandler.Thereturnerrorcode-EBUSYdenotes
failuretoregisterorbindthehandlertoaspecifiedIRQthatisalreadyinuse.
Interrupthandlerroutineshavethefollowingprototype:
irqreturn_thandler(intirq,void*dev_id);
irqspecifiestheIRQnumber,anddev_idistheuniquecookieusedwhile
registeringthehandler.irqreturn_tisatypedeftoanenumeratedintegerconstant:
enumirqreturn{
IRQ_NONE=(0<<0),
IRQ_HANDLED=(1<<0),
IRQ_WAKE_THREAD=(1<<1),
};
typedefenumirqreturnirqreturn_t;
TheinterrupthandlershouldreturnIRQ_NONEtoindicatethattheinterruptwasnot
handled.Itisalsousedtoindicatethatthesourceoftheinterruptwasnotfrom
itsdeviceinasharedIRQcase.Wheninterrupthandlinghascompleted
normally,itmustreturnIRQ_HANDLEDtoindicatesuccess.IRQ_WAKE_THREADisaspecial
flag,returnedtowakeupthethreadedhandler;weelaborateonitinthenext
section.
/**<br/>*free_irq-freeaninterruptallocatedwithrequest_irq<br/>
*@irq:Interruptlinetofree<br/>*@dev_id:Deviceidentityto
free<br/>*<br/>*Removeaninterrupthandler.Thehandleris
removedandifthe<br/>*interruptlineisnolongerinusebyany
driveritisdisabled.<br/>*OnasharedIRQthecallermustensure
theinterruptisdisabled<br/>*onthecarditdrivesbeforecallingthis
function.Thefunction<br/>*doesnotreturnuntilanyexecuting
interruptsforthisIRQ<br/>*havecompleted.<br/>*Returnsthe
devnameargumentpassedtorequest_irq.<br/>*/<br/>constvoid
*free_irq(unsignedintirq,void*dev_id);
dev_idistheuniquecookie(assignedwhileregisteringthehandler)to
identifythehandlertobederegisteredinasharedIRQcase;this
argumentcanbeNULLforothercases.Thisfunctionisapotential
blockingcall,andmustnotbeinvokedfromaninterruptcontext:it
blockscallingcontextuntilcompletionofanyinterrupthandler
currentlyinexecution,forthespecifiedIRQline.
Threadedinterrupthandlers
Handlersregisteredthroughrequest_irq()areexecutedbytheinterrupt-handling
pathofthekernel.Thiscodepathisasynchronous,andrunsbysuspending
schedulerpreemptionandhardwareinterruptsonthelocalprocessor,andsois
referredtoasahardIRQcontext.Thus,itisimperativetoprogramthedriver's
interrupthandlerroutinestobeshort(doaslittleworkaspossible)andatomic
(nonblocking),toensureresponsivenessofthesystem.However,notall
hardwareinterrupthandlerscanbeshortandatomic:thereareamagnitudeof
convoluteddevicesgeneratinginterruptevents,whoseresponsesinvolve
complexvariable-timeoperations.
Conventionally,driversareprogrammedtohandlesuchcomplicationswitha
split-handlerdesignfortheinterrupthandler,calledtophalfandbottomhalf.
Tophalfroutinesareinvokedinhardinterruptcontext,andthesefunctionsare
programmedtoexecuteinterruptcriticaloperations,suchasphysicalI/Oonthe
hardwareregisters,andschedulethebottomhalffordeferredexecution.Bottom
halfroutinesareusuallyprogrammedtodealwiththerestoftheinterruptnon-
criticalanddeferrablework,suchasprocessingofdatageneratedbythetop
half,interactingwithprocesscontext,andaccessinguseraddressspace.The
kerneloffersmultiplemechanismsforschedulingandexecutionofbottomhalf
routines,eachwithadistinctinterfaceAPIandpolicyofexecution.We'll
elaborateonthedesignandusagedetailsofformalbottomhalfmechanismsin
thenextsection.
Asanalternativetousingformalbottom-halfmechanisms,thekernelsupports
settingupinterrupthandlersthatcanexecuteinathreadcontext,called
threadedinterrupthandlers.Driverscansetupthreadedinterrupthandlers
throughanalternateinterfaceroutinecalledrequest_threaded_irq():
/**
*request_threaded_irq-allocateaninterruptline
*@irq:Interruptlinetoallocate
*@handler:FunctiontobecalledwhentheIRQoccurs.
*Primaryhandlerforthreadedinterrupts
*IfNULLandthread_fn!=NULLthedefault
*primaryhandlerisinstalled
*@thread_fn:Functioncalledfromtheirqhandlerthread
*IfNULL,noirqthreadiscreated
*@irqflags:Interrupttypeflags
*@devname:Anasciinamefortheclaimingdevice
*@dev_id:Acookiepassedbacktothehandlerfunction
*/
intrequest_threaded_irq(unsignedintirq,irq_handler_thandler,
irq_handler_tthread_fn,unsignedlongirqflags,
constchar*devname,void*dev_id);
Thefunctionassignedtohandlerservesastheprimaryinterrupthandlerthat
executesinahardIRQcontext.Theroutineassignedtothread_fnisexecutedina
threadcontext,andisscheduledtorunwhentheprimaryhandlerreturns
IRQ_WAKE_THREAD.Withthissplithandlersetup,therearetwopossibleusecases:the
primaryhandlercanbeprogrammedtoexecuteinterrupt-criticalworkanddefer
non-criticalworktothethreadhandlerforlaterexecution,similartothatofthe
bottomhalf.Thealternativeisadesignthatdeferstheentireinterrupt-handling
codeintothethreadhandlerandrestrictstheprimaryhandleronlyfor
verificationoftheinterruptsourceandwakingupthreadroutine.Thisusecase
mightrequirethecorrespondinginterruptlinetobemaskeduntilcompletionof
thethreadhandler,toavoidthenestingofinterrupts.Thiscanbeaccomplished
eitherbyprogrammingtheprimaryhandlertoturnofftheinterruptatsource
beforewakingupthethreadhandlerorthroughaflagbitIRQF_ONESHOTassigned
whileregisteringthethreadedinterrupthandler.
Thefollowingareirqflagsrelatedtothreadedinterrupthandlers:
IRQF_ONESHOT:Theinterruptisnotre-enabledafterthehardIRQhandleris
finished.ThisisusedbythreadedinterruptsthatneedtokeeptheIRQline
disableduntilthethreadedhandlerhasbeenrun.
IRQF_NO_THREAD:Theinterruptcannotbethreaded.ThisisusedinsharedIRQs
torestricttheuseofthreadedinterrupthandlers.
AcalltothisroutinewithNULLassignedtohandlerwillcausethekerneltouse
thedefaultprimaryhandler,whichsimplyreturnsIRQ_WAKE_THREAD.Andacallto
thisfunctionwithNULLassignedtothread_fnissynonymouswithrequest_irq():
staticinlineint__must_check
request_irq(unsignedintirq,irq_handler_thandler,unsignedlongflags,
constchar*name,void*dev)
{
returnrequest_threaded_irq(irq,handler,NULL,flags,name,dev);
}
Anotheralternateinterfaceforsettingupaninterrupthandleris
request_any_context_irq().Thisroutinehasasimilarsignaturetothatof
requeust_irq()butslightlyvariesinitsfunctionality:
/**
*request_any_context_irq-allocateaninterruptline
*@irq:Interruptlinetoallocate
*@handler:FunctiontobecalledwhentheIRQoccurs.
*Threadedhandlerforthreadedinterrupts.
*@flags:Interrupttypeflags
*@name:Anasciinamefortheclaimingdevice
*@dev_id:Acookiepassedbacktothehandlerfunction
*
*Thiscallallocatesinterruptresourcesandenablesthe
*interruptlineandIRQhandling.Itselectseithera
*hardirqorthreadedhandlingmethoddependingonthe
*context.
*Onfailure,itreturnsanegativevalue.Onsuccess,
*itreturnseitherIRQC_IS_HARDIRQorIRQC_IS_NESTED..
*/
intrequest_any_context_irq(unsignedintirq,irq_handler_thandler,
unsignedlongflags,constchar*name,void*dev_id)
Thisfunctiondiffersfromrequest_irq()inthatitlooksintotheIRQdescriptorfor
propertiesoftheinterruptlineassetupbythearchitecture-specificcode,and
decideswhethertoestablishthefunctionassignedasatraditionalhardIRQ
handlerorasathreadedinterrupthandler.Onsuccess,IRQC_IS_HARDIRQisreturned
ifthehandlerwasestablishedtoruninhardIRQcontext,orIRQC_IS_NESTED
otherwise.
Controlinterfaces
ThegenericIRQlayerprovidesroutinestocarryoutcontroloperationsonIRQ
lines.FollowingisthelistoffunctionsformaskingandunmaskingspecificIRQ
lines:voiddisable_irq(unsignedintirq);
ThisdisablesthespecifiedIRQlinebymanipulatingthecounterintheIRQ
descriptorstructure.Thisroutineisapossibleblockingcall,asitwaitsuntilany
runninghandlersforthisinterruptcomplete.Alternatively,thefunction
disable_irq_nosync()canalsobeusedtodisablethegivenIRQline;thiscalldoes
notcheckandwaitforanyrunninghandlersforthegiveninterruptlineto
complete:voiddisable_irq_nosync(unsignedintirq);
DisabledIRQlinescanbeenabledwithacallto:
voidenable_irq(unsignedintirq);
NotethatIRQenableanddisableoperationsnest,thatis,multiplecallsto
disableanIRQlinerequirethesamenumberofenablecallsforthatIRQlineto
bereenabled.Thismeansthatenable_irq()willenablethegivenIRQonlywhena
calltoitmatchesthelastdisableoperation.
Bychoice,interruptscanalsobedisabled/enabledforthelocalCPU;the
followingpairsofmacroscanbeusedforthesame:
local_irq_disable():Todisableinterruptsonthelocalprocessor.
local_irq_enable():Enablesinterruptsforthelocalprocessor.
local_irq_save(unsignedlongflags):DisablesinterruptsonthelocalCPUby
savingcurrentinterruptstateinflags.
local_irq_restore(unsignedlongflags):EnablesinterruptsonthelocalCPUby
restoringinterruptstoapreviousstate.
<spanclass="cm">/*</span>
<spanclass="cm">*per-CPUIRQhandlingstacks</span>
<spanclass="cm">*/</span>
<spanclass="k">struct</span><spanclass="n">irq_stack</span>
<spanclass="p">{</span>
<spanclass="n">u32</span><spanclass="n">stack</span><span
class="p">[</span><spanclass="n">THREAD_SIZE</span><span
class="o">/</span><spanclass="k">sizeof</span><spanclass="p">
(</span><spanclass="n">u32</span><spanclass="p">)];</span>
<spanclass="p">}</span><spanclass="n">__aligned</span><span
class="p">(</span><spanclass="n">THREAD_SIZE</span><span
class="p">);</span>
<spanclass="n">DECLARE_PER_CPU</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">irq_stack</span><spanclass="o">*</span><span
class="p">,</span><spanclass="n">hardirq_stack</span><span
class="p">);</span>
<spanclass="n">DECLARE_PER_CPU</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">irq_stack</span><spanclass="o">*</span><span
class="p">,</span><spanclass="n">softirq_stack</span><span
class="p">);</span>
Apartfromthese,x86-64-bitbuildsalsoincludespecialstacks;more
detailscanbefoundinthekernelsourcedocumentation<x86/kernel-
stacks>:
Doublefaultstack
Debugstack
NMIstack
Mcestack
Deferredwork
Asintroducedinanearliersection,bottomhalvesarekernelmechanismsfor
executingdeferredwork,andcanbeengagedbyanykernelcodetodefer
executionofnon-criticalworkuntilsometimeinthefuture.Tosupport
implementationandformanagementofdeferredroutines,thekernelimplements
specialframeworks,calledsoftirqs,tasklets,andworkqueues.Eachofthese
frameworksconstituteasetofdatastructures,andfunctioninterfaces,usedfor
registering,scheduling,andqueuingofthebottomhalfroutines.Each
mechanismisdesignedwithadistinctpolicyformanagementandexecutionof
bottomhalfs.Driversandotherkernelservicesthatrequiredeferredexecution
willneedtobindandscheduletheirBHroutinesthroughtheappropriate
framework.
Softirqs
Thetermsoftirqlooselytranslatestosoftinterrupt,andasthenamesuggests,
deferredroutinesmanagedbythisframeworkareexecutedatahighprioritybut
withhardinterruptlinesenabled.Thus,softirqbottomhalves(orsoftirqs)can
preemptallothertasksexcepthardinterrupthandlers.However,usageof
softirqsisrestrictedtostatickernelcodeandthismechanismisnotavailablefor
dynamickernelmodules.
Eachsoftirqisrepresentedthroughaninstanceoftypestructsoftirq_action
declaredinthekernelheader<linux/interrupt.h>.Thisstructurecontainsa
functionpointerthatcanholdtheaddressofthebottomhalfroutine:
structsoftirq_action
{
void(*action)(structsoftirq_action*);
};
Currentversionsofthekernelhave10softirqs,eachindexedthroughanenumin
thekernelheader<linux/interrupt.h>.Theseindexesserveasanidentityandare
treatedastherelativepriorityofthesoftirq,andentrieswithlowerindexesare
consideredhigherinpriority,withindex0beingthehighestprioritysoftirq:
enum
{
HI_SOFTIRQ=0,
TIMER_SOFTIRQ,
NET_TX_SOFTIRQ,
NET_RX_SOFTIRQ,
BLOCK_SOFTIRQ,
IRQ_POLL_SOFTIRQ,
TASKLET_SOFTIRQ,
SCHED_SOFTIRQ,
HRTIMER_SOFTIRQ,/*Unused,butkeptastoolsrelyonthe
numbering.Sigh!*/
RCU_SOFTIRQ,/*PreferableRCUshouldalwaysbethelastsoftirq*/
NR_SOFTIRQS
};
Thekernelsourcefile<kernel/softirq.c>declaresanarraycalledsoftirq_vecofsize
NR_SOFTIRQS,witheachoffsetcontainingasoftirq_actioninstanceofthe
correspondingsoftirqindexedintheenum:
staticstructsoftirq_actionsoftirq_vec[NR_SOFTIRQS]__cacheline_aligned_in_smp;
/*stringconstantsfornamingeachsoftirq*/
constchar*constsoftirq_to_name[NR_SOFTIRQS]={
"HI","TIMER","NET_TX","NET_RX","BLOCK","IRQ_POLL",
"TASKLET","SCHED","HRTIMER","RCU"
};
Frameworkprovidesafunctionopen_softriq()usedforinitializingthesoftirq
instancewiththecorrespondingbottom-halfroutine:
voidopen_softirq(intnr,void(*action)(structsoftirq_action*))
{
softirq_vec[nr].action=action;
}
nristheindexofthesoftirqtobeinitializedand*actionisafunctionpointertobe
initializedwiththeaddressofthebottom-halfroutine.Thefollowingcode
excerptistakenfromthetimerservice,andshowstheinvocationofopen_softirq
toregisterasoftirq:
/*kernel/time/timer.c*/
open_softirq(TIMER_SOFTIRQ,run_timer_softirq);
Kernelservicescansignaltheexecutionofsoftirqhandlersusingafunction
raise_softirq().Thisfunctiontakestheindexofthesoftirqasanargument:
voidraise_softirq(unsignedintnr)
{
unsignedlongflags;
local_irq_save(flags);
raise_softirq_irqoff(nr);
local_irq_restore(flags);
}
Thefollowingcodeexcerptisfrom<kernel/time/timer.c>:
voidrun_local_timers(void)
{
structtimer_base*base=this_cpu_ptr(&timer_bases[BASE_STD]);
hrtimer_run_queues();
/*Raisethesoftirqonlyifrequired.*/
if(time_before(jiffies,base->clk)){
if(!IS_ENABLED(CONFIG_NO_HZ_COMMON)||!base->nohz_active)
return;
/*CPUisawake,socheckthedeferrablebase.*/
base++;
if(time_before(jiffies,base->clk))
return;
}
raise_softirq(TIMER_SOFTIRQ);
}
Thekernelmaintainsaper-CPUbitmaskforkeepingtrackofsoftirqsraisedfor
execution,andthefunctionraise_softirq()setsthecorrespondingbit(index
mentionedasargument)inthelocalCPUssoftirqbitmasktomarkthespecified
softirqaspending.
Pendingsoftirqhandlersarecheckedandexecutedatvariouspointsinthekernel
code.Principally,theyareexecutedintheinterruptcontext,immediatelyafter
thecompletionofhardinterrupthandlerswithIRQlinesenabled.This
guaranteesswiftprocessingofsoftirqsraisedfromhardinterrupthandlers,
resultinginoptimalcacheusage.However,thekernelallowsanarbitrarytaskto
suspendexecutionofsoftirqprocessingonalocalprocessoreitherthrough
local_bh_disable()orspin_lock_bh()calls.Pendingsoftirqhandlersareexecutedin
thecontextofanarbitrarytaskthatre-enablessoftirqprocessingbyinvoking
eitherlocal_bh_enable()orspin_unlock_bh()calls.Andlastly,softirqhandlerscan
alsobeexecutedbyaper-CPUkernelthreadksoftirqd,whichiswokenupwhena
softirqisraisedbyanyprocess-contextkernelroutine.Thisthreadisalsowoken
upfromtheinterruptcontextwhentoomanysoftirqsaccumulateduetohigh
load.
Softirqsaremostsuitableforcompletionofpriorityworkdeferredfromhard
interrupthandlerssincetheyrunimmediatelyoncompletionofhardinterrupt
handlers.However,softirqshandlersarereentrant,andmustbeprogrammedto
engageappropriateprotectionmechanismswhileaccessingdatastructures,if
any.Thereentrantnatureofsoftirqsmaycauseunboundedlatencies,impacting
theefficiencyofthesystemasawhole,whichiswhytheirusageisrestricted,
andnewonesarealmostneveradded,unlessitisabsolutenecessityforthe
executionofhigh-frequencythreadeddeferredwork.Forallothertypesof
deferredwork,taskletsandworkqueuesaresuggested.
Tasklets
Thetaskletmechanismisasortofwrapperaroundthesoftirqframework;in
fact,tasklethandlersareexecutedbysoftirqs.Unlikesoftirqs,taskletsarenot
reentrant,whichguaranteesthatthesametasklethandlercanneverrun
concurrently.Thishelpsminimizeoveralllatencies,providedprogrammers
examineandimposerelevantcheckstoensurethatworkdoneinataskletisnon-
blockingandatomic.Anotherdifferenceiswithrespecttotheirusage:unlike
softirqs(whicharerestricted),anykernelcodecanusetasklets,andthisincludes
dynamicallylinkedservices.
Eachtaskletisrepresentedthroughaninstanceoftypestructtasklet_struct
declaredinkernelheader<linux/interrupt.h>:
structtasklet_struct
{
structtasklet_struct*next;
unsignedlongstate;
atomic_tcount;
void(*func)(unsignedlong);
unsignedlongdata;
};
Uponinitialization,*funcholdstheaddressofthehandlerroutineanddataisused
topassadatablobasaparametertothehandlerroutineduringinvocation.Each
taskletcarriesastate,whichcanbeeitherTASKLET_STATE_SCHED,whichindicatesthat
itisscheduledforexecution,orTASKLET_STATE_RUN,whichindicatesitisin
execution.Anatomiccounterisusedtoenableordisableatasklet;whencount
equalsanon-zerovalue,itindicatesthatthetaskletisdisabled,andzero
indicatesthatitisenabled.Adisabledtaskletcannotbeexecutedevenif
scheduled,untilitisenabledatsomefuturetime.
Kernelservicescaninstantiateanewtaskletstaticallythroughanyofthe
followingmacros:
#defineDECLARE_TASKLET(name,func,data)\
structtasklet_structname={NULL,0,ATOMIC_INIT(0),func,data}
#defineDECLARE_TASKLET_DISABLED(name,func,data)\
structtasklet_structname={NULL,0,ATOMIC_INIT(1),func,data}
Newtaskletscanbeinstantiateddynamicallyatruntimethroughthefollowing:
voidtasklet_init(structtasklet_struct*t,
void(*func)(unsignedlong),unsignedlongdata)
{
t->next=NULL;
t->state=0;
atomic_set(&t->count,0);
t->func=func;
t->data=data;
}
Thekernelmaintainstwoper-CPUtaskletlistsforqueuingscheduledtasklets,
andthedefinitionsoftheselistscanbefoundinthesourcefile<kernel/softirq.c>:
/*
*Tasklets
*/
structtasklet_head{
structtasklet_struct*head;
structtasklet_struct**tail;
};
staticDEFINE_PER_CPU(structtasklet_head,tasklet_vec);
staticDEFINE_PER_CPU(structtasklet_head,tasklet_hi_vec);
tasklet_vecisconsiderednormallist,andallqueuedtaskletspresentinthislistare
runbyTASKLET_SOFTIRQ(oneofthe10softirqs).tasklet_hi_vecisahigh-priority
taskletlist,andallqueuedtaskletspresentinthislistareexecutedbyHI_SOFTIRQ,
whichhappenstobethehighestprioritysoftirq.Ataskletcanbequeuedfor
executionintotheappropriatelistbyinvokingtasklet_schedule()or
tasklet_hi_scheudule().
Thefollowingcodeshowstheimplementationoftasklet_schedule();thisfunction
isinvokedwiththeaddressofthetaskletinstancetobequeuedasaparameter:
externvoid__tasklet_schedule(structtasklet_struct*t);
staticinlinevoidtasklet_schedule(structtasklet_struct*t)
{
if(!test_and_set_bit(TASKLET_STATE_SCHED,&t->state))
__tasklet_schedule(t);
}
Theconditionalconstructchecksifthespecifiedtaskletisalreadyscheduled;if
not,itatomicallysetsthestatetoTASKLET_STATE_SCHEDandinvokes__tasklet_shedule()
toenqueuethetaskletinstanceintothependinglist.Ifthespecifiedtaskletis
alreadyfoundtobeintheTASKLET_STATE_SCHEDstate,itisnotrescheduled:
void__tasklet_schedule(structtasklet_struct*t)
{
unsignedlongflags;
local_irq_save(flags);
t->next=NULL;
*__this_cpu_read(tasklet_vec.tail)=t;
__this_cpu_write(tasklet_vec.tail,&(t->next));
raise_softirq_irqoff(TASKLET_SOFTIRQ);
local_irq_restore(flags);
}
Thisfunctionsilentlyenqueuesthespecifiedtasklettothetailofthetasklet_vec
andraisestheTASKLET_SOFTIRQonthelocalprocessor.
Followingisthecodeforthetasklet_hi_scheudle()routine:
externvoid__tasklet_hi_schedule(structtasklet_struct*t);
staticinlinevoidtasklet_hi_schedule(structtasklet_struct*t)
{
if(!test_and_set_bit(TASKLET_STATE_SCHED,&t->state))
__tasklet_hi_schedule(t);
}
Actionsexecutedinthisroutinearesimilartothatoftasklet_schedule(),withan
exceptionthatitinvokes__tasklet_hi_scheudle()toenqueuethespecifiedtasklet
intothetailoftasklet_hi_vec:
void__tasklet_hi_schedule(structtasklet_struct*t)
{
unsignedlongflags;
local_irq_save(flags);
t->next=NULL;
*__this_cpu_read(tasklet_hi_vec.tail)=t;
__this_cpu_write(tasklet_hi_vec.tail,&(t->next));
raise_softirq_irqoff(HI_SOFTIRQ);
local_irq_restore(flags);
}
ThiscallraisesHI_SOFTIRQonthelocalprocessor,whichturnsalltaskletsqueued
intasklet_hi_vecintothehighest-prioritybottomhalves(higherinpriorityover
therestofthesoftirqs).
Anothervariantistasklet_hi_schedule_first(),whichinsertsthespecifiedtaskletto
theheadoftasklet_hi_vecandraisesHI_SOFTIRQ:
externvoid__tasklet_hi_schedule_first(structtasklet_struct*t);
*/
staticinlinevoidtasklet_hi_schedule_first(structtasklet_struct*t)
{
if(!test_and_set_bit(TASKLET_STATE_SCHED,&t->state))
__tasklet_hi_schedule_first(t);
}
/*kernel/softirq.c*/
void__tasklet_hi_schedule_first(structtasklet_struct*t)
{
BUG_ON(!irqs_disabled());
t->next=__this_cpu_read(tasklet_hi_vec.head);
__this_cpu_write(tasklet_hi_vec.head,t);
__raise_softirq_irqoff(HI_SOFTIRQ);
}
Otherinterfaceroutinesexistthatareusedtoenable,disable,andkillscheduled
tasklets.
voidtasklet_disable(structtasklet_struct*t);
Thisfunctiondisablesthespecifiedtaskletbyincrementingitsdisablecount.
Thetaskletmaystillbescheduled,butitisnotexecuteduntilithasbeenenabled
again.Ifthetaskletiscurrentlyrunningwhenthiscallisinvoked,thisfunction
busy-waitsuntilthetaskletcompletes.
voidtasklet_enable(structtasklet_struct*t);
Thisattemptstoenableataskletthathadbeenpreviouslydisabledby
decrementingitsdisablecount.Ifthetasklethasalreadybeenscheduled,itwill
runsoon:
voidtasklet_kill(structtasklet_struct*t);
Thisfunctioniscalledtokillthegiventasklet,toensurethattheitcannotbe
scheduledtorunagain.Ifthetaskletspecifiedisalreadyscheduledbythetime
thiscallisinvoked,thenthisfunctionwaitsuntilitsexecutioncompletes:
voidtasklet_kill_immediate(structtasklet_struct*t,unsignedintcpu);
Thisfunctioniscalledtokillanalreadyscheduledtasklet.Itimmediately
removesthespecifiedtaskletfromthelistevenifthetaskletisinthe
TASKLET_STATE_SCHEDstate.
Workqueues
Workqueues(wqs)aremechanismsfortheexecutionofasynchronousprocess
contextroutines.Asthenameaptlysuggests,aworkqueue(wq)isalistofwork
items,eachcontainingafunctionpointerthattakestheaddressofaroutinetobe
executedasynchronously.Wheneversomekernelcode(thatbelongstoa
subsystemoraservice)intendstodefersomeworkforasynchronousprocess
contextexecution,itmustinitializetheworkitemwiththeaddressofthehandler
function,andenqueueitontoaworkqueue.Thekernelusesadedicatedpoolof
kernelthreads,calledkworkerthreads,toexecutefunctionsboundtoeachwork
iteminthequeue,sequentially.
InterfaceAPI
TheworkqueueAPIofferstwotypesoffunctionsinterfaces:first,asetof
interfaceroutinestoinstantiateandqueueworkitemsontoaglobalworkqueue,
whichissharedbyallkernelsubsystemsandservices,andsecond,asetof
interfaceroutinestosetupanewworkqueue,andqueueworkitemsontoit.We
willbegintoexploreworkqueueinterfaceswithmacrosandfunctionsrelatedto
theglobalsharedworkqueue.
Eachworkiteminthequeueisrepresentedbyaninstanceoftypestruct
work_struct,whichisdeclaredinthekernelheader<linux/workqueue.h>:
structwork_struct{
atomic_long_tdata;
structlist_headentry;
work_func_tfunc;
#ifdefCONFIG_LOCKDEP
structlockdep_maplockdep_map;
#endif
};
funcisapointerthattakestheaddressofthedeferredroutine;anewstructwork
objectcanbecreatedandinitializedthroughmacroDECLARE_WORK:
#defineDECLARE_WORK(n,f)\
structwork_structn=__WORK_INITIALIZER(n,f)
nisthenameoftheinstancetobecreatedandfistheaddressofthefunctionto
beassigned.Aworkinstancecanbescheduledintotheworkqueuethrough
schedule_work():
boolschedule_work(structwork_struct*work);
ThisfunctionenqueuesthegivenworkitemonthelocalCPUworkqueue,but
doesnotguaranteeitsexecutiononit.Itreturnstrueifthegivenworkis
successfullyenqueued,orfalseifthegivenworkisalreadyfoundinthe
workqueue.Oncequeued,thefunctionassociatedwiththeworkitemisexecuted
onanyoftheavailableCPUsbytherelevantkworkerthread.Alternatively,awork
itemcanbemarkedforexecutiononaspecificCPU,whileschedulingitintothe
queue(whichmightyieldbettercacheutilization);thiscanbedonewithacallto
scheudule_work_on():
boolschedule_work_on(intcpu,structwork_struct*work);
cpuistheidentifiertowhichtheworktaskistobebound.Forinstance,to
scheduleaworktaskontoalocalCPU,thecallercaninvoke:
schedule_work_on(smp_processor_id(),&t_work);
smp_processor_id()isakernelmacro(definedin<linux/smp.h>)thatreturnsthelocal
CPUidentifier.
TheinterfaceAPIalsooffersavariantofschedulingcalls,whichallowthecaller
toqueueworktaskswhoseexecutionisguaranteedtobedelayedatleastuntila
specifiedtimeout.Thisisachievedbybindingaworktaskwithatimer,which
canbeinitializedwithanexpirytimeout,untilwhichtimetheworktaskisnot
scheduledintothequeue:
structdelayed_work{
structwork_structwork;
structtimer_listtimer;
/*targetworkqueueandCPU->timerusestoqueue->work*/
structworkqueue_struct*wq;
intcpu;
};
timerisaninstanceofadynamictimerdescriptor,whichisinitializedwiththe
expiryintervalandarmedwhileschedulingaworktask.We'lldiscusskernel
timersandothertime-relatedconceptsmoreinthenextchapter.
Callerscaninstantiatedelayed_workandinitializeitstaticallythroughamacro:
#defineDECLARE_DELAYED_WORK(n,f)\
structdelayed_workn=__DELAYED_WORK_INITIALIZER(n,f,0)
Similartonormalworktasks,delayedworktaskscanbescheduledtorunonany
oftheavailableCPUsorbescheduledtoexecuteonaspecifiedcore.To
scheduledelayedworkthatcanrunonanyoftheavailableprocessors,callers
caninvokeschedule_delayed_work(),andtoscheduledelayedworkontospecific
CPUs,usethefunctionschedule_delayed_work_on():
boolschedule_delayed_work(structdelayed_work*dwork,unsignedlongdelay);
boolschedule_delayed_work_on(intcpu,structdelayed_work*dwork,
unsignedlongdelay);
Notethatifthedelayiszero,thenthespecifiedworkitemisscheduledfor
immediateexecution.
Creatingdedicatedworkqueues
Timingoftheexecutionofworkitemsscheduledontotheglobalworkqueueis
notpredictable:onelong-runningworkitemcanalwayscauseindefinitedelays
fortherest.Alternatively,theworkqueueframeworkallowstheallocationof
dedicatedworkqueues,whichcanbeownedbyakernelsubsystemoraservice.
InterfaceAPIsusedtocreateandscheduleworkintothesequeuesprovide
controlflags,throughwhichownerscansetspecialattributessuchasCPU
locality,concurrencylimits,andpriority,whichhaveaninfluenceonthe
executionofworkitemsqueued.
Anewworkqueuecanbesetupthroughacalltoalloc_workqueue();thefollowing
excerpttakenfrom<fs/nfs/inode.c>showssampleusage:structworkqueue_struct
*wq;...wq=alloc_workqueue("nfsiod",WQ_MEM_RECLAIM,0);
Thiscalltakesthreearguments:thefirstisastringconstanttonamethe
workqueue.Thesecondargumentisthebitfieldofflags,andthethirdaninteger
calledmax_active.Thelasttwoareusedtospecifycontrolattributesofthequeue.
Onsuccess,thisfunctionreturnstheaddressoftheworkqueuedescriptor.
Thefollowingisalistofflagoptions:
WQ_UNBOUND:Workqueuescreatedwiththisflagaremanagedbykworker-pools
thatarenotboundtoanyspecificCPU.Thiscausesallworkitems
scheduledtothisqueuetorunonanyavailableprocessor.Workitemsin
thisqueueareexecutedassoonaspossiblebykworkerpools.
WQ_FREEZABLE:Aworkqueueofthistypeisfreezable,whichmeansthatitis
affectedbysystemsuspendoperations.Duringsuspend,allcurrentwork
itemsaredrainedandnonewworkitemcanrununtilthesystemis
unfreezedorresumed.
WQ_MEM_RECLAIM:Thisflagisusedtomarkaworkqueuethatcontainswork
itemsinvolvedinmemoryreclaimpaths.Thiscausestheframeworkto
ensurethatthereisalwaysaworkerthreadavailabletorunworkitemson
thisqueue.
WQ_HIGHPRI:Thisflagisusedtomarkaworkqueueashighpriority.Work
itemsinhigh-priorityworkqueueshaveahigherprecedenceovernormal
ones,inthattheseareexecutedbyahigh-prioritypoolofkworkerthreads.
Thekernelmaintainsadedicatedpoolofhigh-prioritykworkerthreadsfor
eachCPU,whicharedistinctfromnormalkworkerpools.
WQ_CPU_INTENSIVE:ThisflagmarksworkitemsonthisworkqueuetobeCPU
intensive.Thishelpsthesystemschedulertoregulatetheexecutionofwork
itemsthatareexpectedtohogtheCPUforlongintervals.Thismeans
runnableCPU-intensiveworkitemswillnotpreventotherworkitemsinthe
samekworker-poolfromstarting.Arunnablenon-CPU-intensiveworkitem
canalwaysdelaytheexecutionofworkitemsmarkedasCPUintensive.
Thisflagismeaninglessforanunboundwq.
WQ_POWER_EFFICIENT:Workqueuesmarkedwiththisflagareper-CPUby
default,butbecomeunboundifthesystemwasbootedwiththe
workqueue.power_efficientkernelparamset.Per-CPUworkqueuesthatare
identifiedtocontributesignificantlytopowerconsumptionareidentified
andmarkedwiththisflag,andenablingthepower_efficientmodeleadsto
noticeablepowersavingsatthecostofaslightperformancepenalty.
Thefinalargumentmax_activeisaninteger,whichmustspecifythecountofwork
itemsthatcanbeexecutedsimultaneouslyfromthisworkqueueonanygiven
CPU.
Onceadedicatedworkqueueissetup,workitemscanbescheduledthroughany
ofthefollowingcalls:boolqueue_work(structworkqueue_struct*wq,struct
work_struct*work);
wqisapointertoaqueue;itenqueuesthespecifiedworkitemonthelocalCPU,
butdoesnotguaranteeexecutiononlocalprocessor.Thiscallreturnstrueifthe
givenworkitemissuccessfullyqueued,andfalseifthegivenworkitemis
alreadyscheduled.
Alternatively,callerscanenqueueaworkitemboundtoaspecificCPUwitha
callto:boolqueue_work_on(intcpu,structworkqueue_struct*wq,struct
work_struct
*work);
Onceaworkitemisenqueuedintoaworkqueueofthespecifiedcpu,itreturns
trueifthegivenworkitemissuccessfullyqueuedandfalseifthegivenwork
itemisalreadyfoundinthequeue.
SimilartosharedworkqueueAPIs,delayedschedulingoptionsalsoareavailable
fordedicatedworkqueues.Thefollowingcallsaretobeusedfordelayed
schedulingofworkitems:boolqueue_delayed_work_on(intcpu,struct
workqueue_struct*wq,structdelayed_work*dwork,unsignedlongdelay);
boolqueue_delayed_work(structworkqueue_struct*wq,structdelayed_work
*dwork,unsignedlongdelay
Bothcallsdelayschedulingofthegivenworkuntilthetimeoutspecifiedbythe
delayhaselapsed,withtheexceptionthatqueue_delayed_work_on()enqueuesthe
givenworkitemonthespecifiedCPUandguaranteesitsexecutiononit.Note
thatifthedelayspecifiediszeroandtheworkqueueisidle,thenthegivenwork
itemisscheduledforimmediateexecution.
Summary
Throughthischapter,wehavetouchedbasewithinterrupts,thevarious
componentsthatfabricatethewholeinfrastructure,andhowthekernelmanages
itefficiently.Weunderstoodhowthekernelengagesabstractiontosmoothly
handlevariedinterruptsignalsroutedfromvariouscontrollers.Thekernel's
effortinsimplifyingcomplexprogrammingapproachesisagainbroughttothe
forethroughthehigh-levelinterrupt-managementinterfaces.Wealsostretched
ourunderstandingonallthekeyroutinesandimportantdatastructuresofthe
interruptsubsystem.Wealsoexploredkernelmechanismsforhandlingdeferred
work.
Inthenextchapter,wewillexplorethekernel'stimekeepingsubsystemto
understandkeyconceptssuchastimemeasurement,intervaltimers,andtimeout
anddelayroutines.
ClockandTimeManagement
TheLinuxtimemanagementsubsystemmanagesvarioustime-relatedactivities
andkeepstrackoftimingdatasuchascurrenttimeanddate,timeelapsedsince
systembootup(systemuptime)andtimeouts,forexample,howlongtowaitfor
aparticulareventtobeinitiatedorterminated,lockingthesystemafteratimeout
periodhaselapsed,orraisingasignaltokillanunresponsiveprocess.
TherearetwotypesoftimingactivitieshandledbytheLinuxtimemanagement
subsystem:
Keepingthecurrenttimeanddate
Maintainingtimers
Timerepresentation
Dependingontheusecases,timeisrepresentedinthreedifferentwaysinLinux:
1. Walltime(orrealtime):Thisistheactualtimeanddateintherealworld,
suchas07:00AM,10Aug2017,andisusedfortimestampsonfilesand
packetssentthroughthenetwork.
2. Processtime:Thisisthetimeconsumedbyaprocessinitslifespan.It
includesthetimeconsumedbytheprocessinusermodeandthetime
consumedbythekernelcodewhenexecutingonbehalfoftheprocess.This
isusefulforstatisticalpurposes,auditing,andprofiling.
3. Monotonictime:Thisisthetimeelapsedsincesystembootup.It'sever
incrementingandmonotonicinnature(systemuptime).
Thesethreetimesaremeasuredineitherofthefollowingways:
1. Relativetime:Thisisthetimerelativetosomespecificevent,suchas7
minutessincesystembootup,or2minutessincelastinputfromuser.
2. Absolutetime:Thisisauniquepointintimewithoutanyreferencetoa
previousevent,suchas10:00AM,12Aug2017.InLinux,absolutetimeis
representedasthenumberofelapsedsecondssince00:00:00midnightof1
January1970(UTC)
Walltimeiseverincrementing(unlessithasbeenmodifiedbytheuser),even
betweenrebootsandshutdowns,butprocesstimeandsystemuptimestartfrom
somepredefinedpointintime(usuallyzero)everytimeanewprocessiscreated
orwhenthesystemstarts.
Timinghardware
Linuxreliesonappropriatehardwaredevicestomaintaintime.Thesehardware
devicescanbecategorizedbroadlyintotwotypes:systemclockandtimers.
Real-timeclock(RTC)
Keepingtrackofthecurrenttimeanddateisverycrucial,notjusttolettheuser
knowaboutitbuttouseitasatimestampforvariousresourcesinthesystem,
specifically,filespresentinsecondarystorage.Everyfilehasmetadata
informationsuchasthedateofcreationandlastmodificationdate,andevery
timeafileiscreatedormodified,thesetwofieldsareupdatedwiththecurrent
timeinthesystem.Thesefieldsareusedbyseveralappstomanagefilessuchas
tosort,group,orevendeletethem(ifthefilehasn'tbeenaccessedaforlong
time).Themaketoolusesthistimestamptodeterminewhetherasourcefilehas
beeneditedsincethelasttimeitaccessedit;onlythenisitcompiled,otherwise
leftuntouched.
ThesystemclockRTCkeepstrackofthecurrenttimeanddate;backedbyan
additionalbattery,itcontinuestotickevenwhenthesystemisturnedoff.
RTCcanraiseinterruptsonIRQ8periodically.Thisfeaturecanbeusedasan
alarmfacility,byprogrammingtheRTCtoraiseinterruptonIRQ8whenit
reachesaspecifictime.InIBM-compatiblePCs,theRTCismappedtothe0x70
and0x71I/Oports.Itcanbeaccessedthroughthe/dev/rtcdevicefile.
<spanclass="k">struct</span><span
class="n">x86_platform_ops</span><spanclass="p">{</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="p">(</span><spanclass="o">*</span><span
class="n">calibrate_cpu</span><spanclass="p">)(</span><span
class="kt">void</span><spanclass="p">);</span><span
class="kt">unsigned</span><spanclass="nf">long</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">calibrate_tsc</span><spanclass="p">)(</span><span
class="kt">void</span><spanclass="p">);</span><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">get_wallclock</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">timespec</span><spanclass="o">*</span><span
class="n">ts</span><spanclass="p">);</span><span
class="kt">int</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">set_wallclock</span><spanclass="p">)
(</span><spanclass="k">const</span><span
class="k">struct</span><spanclass="n">timespec</span><span
class="o">*</span><spanclass="n">ts</span><spanclass="p">);
</span><spanclass="kt">void</span><spanclass="p">(</span>
<spanclass="o">*</span><spanclass="n">iommu_shutdown</span>
<spanclass="p">)(</span><spanclass="kt">void</span><span
class="p">);</span><spanclass="kt">bool</span><spanclass="p">
(</span><spanclass="o">*</span><span
class="n">is_untracked_pat_range</span><spanclass="p">)(</span>
<spanclass="n">u64</span><spanclass="n">start</span><span
class="p">,</span><spanclass="n">u64</span><span
class="n">end</span><spanclass="p">);</span><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">nmi_init</span><spanclass="p">)(</span>
<spanclass="kt">void</span><spanclass="p">);</span><span
class="kt">unsigned</span><spanclass="nf">char</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">get_nmi_reason</span><spanclass="p">)(</span><span
class="kt">void</span><spanclass="p">);</span><span
class="kt">void</span><spanclass="p">(</span><spanclass="o">*
</span><spanclass="n">save_sched_clock_state</span><span
class="p">)(</span><spanclass="kt">void</span><spanclass="p">);
</span><spanclass="kt">void</span><spanclass="p">(</span>
<spanclass="o">*</span><span
class="n">restore_sched_clock_state</span><spanclass="p">)
(</span><spanclass="kt">void</span><spanclass="p">);</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">apic_post_init</span><span
class="p">)(</span><spanclass="kt">void</span><spanclass="p">);
</span><spanclass="k">struct</span><span
class="n">x86_legacy_features</span><span
class="n">legacy</span><spanclass="p">;</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">set_legacy_features</span>
<spanclass="p">)(</span><spanclass="kt">void</span><span
class="p">);</span><spanclass="p">};</span>
Thisdatastructuremanagesothertimingoperationstoo,suchas
gettingtimefromtheRTCthroughget_wallclock()orsettingtime
ontheRTCthroughtheset_wallclock()callback.
Programmableinterrupttimer(PIT)
Therearecertaintasksthatneedtobecarriedoutbythekernelatregular
intervals,suchas:
Updatingthecurrenttimeanddate(atmidnight)
Updatingthesystemrunningtime(uptime)
Keepingtrackofthetimeconsumedbyeachprocesssothattheydon't
exceedthetimeallottedtorunontheCPU
Keepingtrackofvarioustimeractivities
Inordertocarryoutthesetasks,interruptsmustbeperiodicallyraised.Every
timethisperiodicinterruptisraised,thekernelknowsit'stimetoupdatethe
aforementionedtimingdata.ThePITisthepieceofhardwareresponsiblefor
issuingthisperiodicinterrupt,calledtimerinterrupt.ThePITkeepsonissuing
timerinterruptsonIRQ0periodicallyatapproximately1000Hzfrequency,once
everymillisecond.Thisperiodicinterruptiscalledthetickandthefrequencyat
whichit'sissuediscalledthetickrate.Thetickratefrequencyisdefinedbythe
kernelmacroHZandismeasuredinhertz.
Systemresponsivenessdependsonthetickrate:theshortertheticks,themore
responsiveasystemwouldbe,andviceversa.Withshorterticks,poll()and
select()systemcallswillhaveafasterresponsetime.However,theconsiderable
drawbackofashortertickrateisthattheCPUwillbeworkinginkernelmode
(executingtheinterrupthandlerforthetimerinterrupt)mostofthetime,leaving
lesstimeforuser-modecode(programs)toexecuteonit.Inahigh-performance
CPU,itwouldn'tbemuchofanoverhead,butinslowerCPUs,theoverall
systemperformancewouldbeaffectedconsiderably.
Toreachabalancebetweenresponsetimeandsystemperformance,atickrateof
100Hzisusedinmostmachines.ExceptforAlphaandm68knommu,whichuse
a1000Hztickrate,therestofthecommonarchitectures,includingx86(arm,
powerpc,sparc,mips,andsoon)usea100Hztickrate.CommonPIThardware
foundinx86machinesisIntel8253.It'sI/Omappedandaccessedthrough
addresses0x40–0x43.ThePITisinitializedbysetup_pit_timer(),definedinthe
arch/x86/kernel/i8253.cfile:
void__initsetup_pit_timer(void)
{
clockevent_i8253_init(true);
global_clock_event=&i8253_clockevent;
}
Thiscallsclockevent_i8253_init()internally,definedin<drivers/clocksource/i8253.c>:
void__initclockevent_i8253_init(booloneshot)
{
if(oneshot)
i8253_clockevent.features|=CLOCK_EVT_FEAT_ONESHOT;
/*
*Startpitwiththebootcpumask.x86mightmakeitglobal
*whenitisusedasbroadcastdevicelater.
*/
i8253_clockevent.cpumask=cpumask_of(smp_processor_id());
clockevents_config_and_register(&i8253_clockevent,PIT_TICK_RATE,
0xF,0x7FFF);
}
#endif
CPUlocaltimer
PITisaglobaltimer,andinterruptsraisedbyitthatcanbehandledbyanyCPU
inanSMPsystem.Insomecases,havingsuchacommontimerisbeneficial,
whereasinothercases,aper-CPUtimerismoredesirable.InanSMPsystem,
keepingprocesstimeandmonitoringallottedtimeslicestoaprocessineach
CPUwouldbemucheasierandefficientwithalocaltimer.
LocalAPICinrecentx86microprocessorsembedssuchaCPUlocaltimer.A
CPUlocaltimercanissueinterruptseitheronceorperiodically.Itusesa32-bit
timerandcanissueinterruptsataverylowfrequency(thiswidercounterallows
moretickstooccurbeforeaninterruptisraised).TheAPICtimerworkswiththe
busclocksignal.TheAPICtimerisquitesimilartoPITexceptthatit'slocalto
theCPU,hasa32-bitcounter(PIThasa16-bitone),andworkswiththebus
clocksignal(PITusesitsownclocksignal).
High-precisioneventtimer(HPET)
TheHPETworkswithclocksignalsinexcessof10Mhz,issuinginterruptsonce
every100nanoseconds,hencethenamehigh-precision.HPETimplementsa64-
bitmaincountertocountatsuchahighfrequency.Itwasco-developedbyIntel
andMicrosoftfortheneedofanewhigh-resolutiontimer.HPETembedsa
collectionoftimers.Eachofthemiscapableofissuinginterruptsindependently,
andcanbeusedbyspecificapplicationsasassignedbythekernel.Thesetimers
aremanagedasgroupsoftimers,whereeachgroupcanhaveamaximumof32
timersinit.AnHPETcanimplementmaximumof8suchgroups.Eachtimer
hasasetofcomparatorandmatchregister.Atimerissuesaninterruptwhenthe
valueinitsmatchregistermatchesthevalueofthemaincounter.Timerscanbe
programmedtogenerateinterruptseitheronceorperiodically.
Registersarememorymappedandhaverelocatableaddressspace.During
systembootup,theBIOSsetsuptheregisters'addressspaceandpassesittothe
kernel.OncetheBIOSmapstheaddress,it'sseldomremappedbythekernel.
ACPIpowermanagementtimer
(ACPIPMT)
TheACPIPMTisasimplecounterthathasafixedfrequencyclockat3.58Mhz.
Itincrementsoneachtick.ThePMTisportmapped;theBIOStakescareof
addressmappinginthehardwareinitializationphaseduringbootup.ThePMTis
morereliablethantheTSC,asitworkswithaconstantclockfrequency.The
TSCdependsontheCPUclock,whichcanbeunderclockedoroverclockedas
perthecurrentload,resultingintimedilationandinaccuratemeasurements.
Amongall,theHPETispreferablesinceitallowsveryshorttimeintervalsif
presentinthesystem.
<spanclass="k">struct</span><spanclass="n">clocksource</span>
<spanclass="p">{</span>
<spanclass="n">u64</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">read</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">cs</span><spanclass="p">);</span>
<spanclass="n">u64</span><spanclass="n">mask</span><span
class="p">;</span>
<spanclass="n">u32</span><spanclass="n">mult</span><span
class="p">;</span>
<spanclass="n">u32</span><spanclass="n">shift</span><span
class="p">;</span>
<spanclass="n">u64</span><spanclass="n">max_idle_ns</span>
<spanclass="p">;</span>
<spanclass="n">u32</span><spanclass="n">maxadj</span>
<spanclass="p">;</span>
<spanclass="cp">#ifdef
CONFIG_ARCH_CLOCKSOURCE_DATA</span>
<spanclass="k">struct</span><span
class="n">arch_clocksource_data</span><span
class="n">archdata</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="n">u64</span><spanclass="n">max_cycles</span>
<spanclass="p">;</span>
<spanclass="k">const</span><spanclass="kt">char</span>
<spanclass="o">*</span><spanclass="n">name</span><span
class="p">;</span>
<spanclass="k">struct</span><spanclass="n">list_head</span>
<spanclass="n">list</span><spanclass="p">;</span>
<spanclass="kt">int</span><spanclass="n">rating</span><span
class="p">;</span>
<spanclass="kt">int</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">enable</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">cs</span><spanclass="p">);</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">disable</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">cs</span><spanclass="p">);</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">flags</span><spanclass="p">;</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">suspend</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">cs</span><spanclass="p">);</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">resume</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">cs</span><spanclass="p">);</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">mark_unstable</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">cs</span><spanclass="p">);</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">tick_stable</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">cs</span><spanclass="p">);</span>
<spanclass="cm">/*private:*/</span>
<spanclass="cp">#ifdef
CONFIG_CLOCKSOURCE_WATCHDOG</span>
<spanclass="cm">/*Watchdogrelateddata,usedbytheframework
*/</span>
<spanclass="k">struct</span><spanclass="n">list_head</span>
<spanclass="n">wd_list</span><spanclass="p">;</span>
<spanclass="n">u64</span><spanclass="n">cs_last</span><span
class="p">;</span>
<spanclass="n">u64</span><spanclass="n">wd_last</span>
<spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="k">struct</span><spanclass="n">module</span>
<spanclass="o">*</span><spanclass="n">owner</span><span
class="p">;</span>
<spanclass="p">};</span>
Membersmultandshiftareusefulinobtainingelapsedtimein
relevantunits.
Calculatingelapsedtime
Untilthispointweknowthatineverysystemthereisafree-running,ever-
incrementingcounter,andalltimeisderivedfromit,beitwalltimeorany
duration.Themostnaturalideaheretocalculatethetime(secondselapsedsince
thestartofcounter)wouldbedividingthenumberofcyclesprovidedbythis
counterwiththeclockfrequency,asexpressedinthefollowingformula:
Time(seconds)=(countervalue)/(clockfrequency)
Thereisacatchwiththisapproach,however:itinvolvesdivision(whichworks
onaniterativealgorithm,makingittheslowestamongthefourbasicarithmetic
operations)andfloatingpointcalculations,whichmightbesloweroncertain
architectures.Whileworkingwithembeddedplatforms,floatingpoint
calculationsareevidentlyslowerthantheyareonPCorserverplatforms.
Sohowdoweovercomethisissue?Insteadofdivision,timeiscalculatedusing
multiplicationandbitwiseshiftoperations.Thekernelprovidesahelperroutine
thatderivesthetimethisway.clocksource_cyc2ns(),definedin
include/linux/clocksource.h,convertstheclocksourcecyclestonanoseconds:
staticinlines64clocksource_cyc2ns(u64cycles,u32mult,u32shift)
{
return((u64)cycles*mult)>>shift;
}
Here,theparametercyclesisthenumberofelapsedcyclesfromtheclock
source,multisthecycle-to-nanosecondmultiplier,whileshiftisthecycle-to-
nanoseconddivisor(poweroftwo).Boththeseparametersareclocksource
dependent.Thesevaluesareprovidedbytheclocksourcekernelabstraction
discussedearlier.
Clocksourcehardwarearenotaccurateallthetime;theirfrequencymightvary.
Thisclockvariationcausestimedrift(makingtheclockrunfasterorslower).In
suchcases,thevariablemultcanbeadjustedtomakeupforthistimedrift.
Thehelperroutineclocks_calc_mult_shift(),definedinkernel/time/clocksource.c,
helpsevaluatemultandshiftfactors:
void
clocks_calc_mult_shift(u32*mult,u32*shift,u32from,u32to,u32maxsec)
{
u64tmp;
u32sft,sftacc=32;
/*
*Calculatetheshiftfactorwhichislimitingtheconversion
*range:
*/
tmp=((u64)maxsec*from)>>32;
while(tmp){
tmp>>=1;
sftacc--;
}
/*
*Findtheconversionshift/multpairwhichhasthebest
*accuracyandfitsthemaxsecconversionrange:
*/
for(sft=32;sft>0;sft--){
tmp=(u64)to<<sft;
tmp+=from/2;
do_div(tmp,from);
if((tmp>>sftacc)==0)
break;
}
*mult=tmp;
*shift=sft;
}
Timedurationbetweentwoeventscanbecalculatedasshowninthefollowing
codesnippet:
structclocksource*cs=&curr_clocksource;
cycle_tstart=cs->read(cs);
/*thingstodo*/
cycle_tend=cs->read(cs);
cycle_tdiff=end–start;
duration=clocksource_cyc2ns(diff,cs->mult,cs->shift);
Linuxtimekeepingdatastructures,
macros,andhelperroutines
Wewillnowbroadenourawarenessbylookingatsomekeytimekeeping
structures,macros,andhelperroutinesthatcanassistprogrammersinextracting
specifictime-relateddata.
<spanclass="n">u64</span><spanclass="nf">get_jiffies_64</span>
<spanclass="p">(</span><spanclass="kt">void</span><span
class="p">)</span><spanclass="p">{</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">seq</span><spanclass="p">;</span><span
class="n">u64</span><spanclass="n">ret</span><spanclass="p">;
</span>
<spanclass="k">do</span><spanclass="p">{</span><span
class="n">seq</span><spanclass="o">=</span><span
class="n">read_seqbegin</span><spanclass="p">(</span><span
class="o">&</span><spanclass="n">jiffies_lock</span><span
class="p">);</span><spanclass="n">ret</span><spanclass="o">=
</span><spanclass="n">jiffies_64</span><spanclass="p">;</span>
<spanclass="p">}</span><spanclass="k">while</span><span
class="p">(</span><spanclass="n">read_seqretry</span><span
class="p">(</span><spanclass="o">&</span><span
class="n">jiffies_lock</span><spanclass="p">,</span><span
class="n">seq</span><spanclass="p">));</span><span
class="k">return</span><spanclass="n">ret</span><span
class="p">;</span><spanclass="p">}</span>
<spanclass="cp">#definetime_after(a,b)\</span>
<spanclass="cp">(typecheck(unsignedlong,a)&&\</span><span
class="cp">typecheck(unsignedlong,b)&&\</span><span
class="cp">((long)((b)-(a))<0))</span>
<spanclass="cp">#definetime_before(a,b)time_after(b,a)</span>
<spanclass="cp">#definetime_after_eq(a,b)\</span>
<spanclass="cp">(typecheck(unsignedlong,a)&&\</span><span
class="cp">typecheck(unsignedlong,b)&&\</span><span
class="cp">((long)((a)-(b))>=0))</span>
<spanclass="cp">#definetime_before_eq(a,b)time_after_eq(b,a)
</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">jiffies_to_msecs</span><spanclass="p">(</span>
<spanclass="k">const</span><spanclass="kt">unsigned</span>
<spanclass="kt">long</span><spanclass="n">j</span><span
class="p">)</span><spanclass="p">{</span>
<spanclass="cp">#ifHZ<=MSEC_PER_SEC&&!
(MSEC_PER_SEC%HZ)</span><spanclass="k">return</span>
<spanclass="p">(</span><spanclass="n">MSEC_PER_SEC</span>
<spanclass="o">/</span><spanclass="n">HZ</span><span
class="p">)</span><spanclass="o">*</span><span
class="n">j</span><spanclass="p">;</span><spanclass="cp">#elif
HZ>MSEC_PER_SEC&&!(HZ%MSEC_PER_SEC)</span>
<spanclass="k">return</span><spanclass="p">(</span><span
class="n">j</span><spanclass="o">+</span><spanclass="p">
(</span><spanclass="n">HZ</span><spanclass="o">/</span>
<spanclass="n">MSEC_PER_SEC</span><spanclass="p">)</span>
<spanclass="o">-</span><spanclass="mi">1</span><span
class="p">)</span><spanclass="o">/</span><spanclass="p">
(</span><spanclass="n">HZ</span><spanclass="o">/</span>
<spanclass="n">MSEC_PER_SEC</span><spanclass="p">);
</span><spanclass="cp">#else</span>
<spanclass="cp">#ifBITS_PER_LONG==32</span>
<spanclass="k">return</span><spanclass="p">(</span><span
class="n">HZ_TO_MSEC_MUL32</span><spanclass="o">*
</span><spanclass="n">j</span><spanclass="p">)</span><span
class="o">>></span><span
class="n">HZ_TO_MSEC_SHR32</span><spanclass="p">;</span>
<spanclass="cp">#else</span>
<spanclass="k">return</span><spanclass="p">(</span><span
class="n">j</span><spanclass="o">*</span><span
class="n">HZ_TO_MSEC_NUM</span><spanclass="p">)</span>
<spanclass="o">/</span><span
class="n">HZ_TO_MSEC_DEN</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#endif</span>
<spanclass="p">}<br/><br/></span><span
class="kt">unsigned</span><spanclass="kt">int</span><span
class="nf">jiffies_to_usecs</span><spanclass="p">(</span><span
class="k">const</span><spanclass="kt">unsigned</span><span
class="kt">long</span><spanclass="n">j</span><spanclass="p">)
</span><spanclass="p">{</span>
<spanclass="cm">/*</span>
<spanclass="cm">*Hzdoesn'tgomuchfurtherMSEC_PER_SEC.
</span><spanclass="cm">*jiffies_to_usecs()andusecs_to_jiffies()
dependonthat.</span><spanclass="cm">*/</span>
<spanclass="n">BUILD_BUG_ON</span><spanclass="p">
(</span><spanclass="n">HZ</span><spanclass="o">></span>
<spanclass="n">USEC_PER_SEC</span><spanclass="p">);
</span>
<spanclass="cp">#if!(USEC_PER_SEC%HZ)</span>
<spanclass="k">return</span><spanclass="p">(</span><span
class="n">USEC_PER_SEC</span><spanclass="o">/</span>
<spanclass="n">HZ</span><spanclass="p">)</span><span
class="o">*</span><spanclass="n">j</span><spanclass="p">;
</span><spanclass="cp">#else</span>
<spanclass="cp">#ifBITS_PER_LONG==32</span>
<spanclass="k">return</span><spanclass="p">(</span><span
class="n">HZ_TO_USEC_MUL32</span><spanclass="o">*
</span><spanclass="n">j</span><spanclass="p">)</span><span
class="o">>></span><span
class="n">HZ_TO_USEC_SHR32</span><spanclass="p">;</span>
<spanclass="cp">#else</span>
<spanclass="k">return</span><spanclass="p">(</span><span
class="n">j</span><spanclass="o">*</span><span
class="n">HZ_TO_USEC_NUM</span><spanclass="p">)</span>
<spanclass="o">/</span><span
class="n">HZ_TO_USEC_DEN</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#endif</span>
<spanclass="p">}<br/><br/></span><spanclass="k">static</span>
<spanclass="kr">inline</span><spanclass="n">u64</span><span
class="nf">jiffies_to_nsecs</span><spanclass="p">(</span><span
class="k">const</span><spanclass="kt">unsigned</span><span
class="kt">long</span><spanclass="n">j</span><spanclass="p">)
</span><spanclass="p">{</span>
<spanclass="k">return</span><spanclass="p">(</span><span
class="n">u64</span><spanclass="p">)</span><span
class="n">jiffies_to_usecs</span><spanclass="p">(</span><span
class="n">j</span><spanclass="p">)</span><spanclass="o">*
</span><spanclass="n">NSEC_PER_USEC</span><span
class="p">;</span><spanclass="p">}</span>
Otherconversionroutinescanbeexploredinthe
include/linux/jiffies.hfile.
<spanclass="k">struct</span><spanclass="n">timespec</span>
<spanclass="p">{</span>
<spanclass="n">__kernel_time_t</span><span
class="n">tv_sec</span><spanclass="p">;</span><span
class="cm">/*seconds*/</span><spanclass="kt">long</span>
<spanclass="n">tv_nsec</span><spanclass="p">;</span><span
class="cm">/*nanoseconds*/</span><spanclass="p">};</span>
<spanclass="cp">#endif</span>
<spanclass="k">struct</span><spanclass="n">timeval</span>
<spanclass="p">{</span>
<spanclass="n">__kernel_time_t</span><span
class="n">tv_sec</span><spanclass="p">;</span><span
class="cm">/*seconds*/</span><span
class="n">__kernel_suseconds_t</span><span
class="n">tv_usec</span><spanclass="p">;</span><span
class="cm">/*microseconds*/</span><spanclass="p">};</span>
<spanclass="k">struct</span><span
class="n">tk_read_base</span><spanclass="p">{</span>
<spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">clock</span><spanclass="p">;</span><span
class="n">cycle_t</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">read</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">clocksource</span><spanclass="o">*</span><span
class="n">cs</span><spanclass="p">);</span><span
class="n">cycle_t</span><spanclass="n">mask</span><span
class="p">;</span><spanclass="n">cycle_t</span><span
class="n">cycle_last</span><spanclass="p">;</span><span
class="n">u32</span><spanclass="n">mult</span><span
class="p">;</span><spanclass="n">u32</span><span
class="n">shift</span><spanclass="p">;</span><span
class="n">u64</span><spanclass="n">xtime_nsec</span><span
class="p">;</span><spanclass="n">ktime_t</span><span
class="n">base_mono</span><spanclass="p">;</span><span
class="p">};</span>
<spanclass="k">struct</span><spanclass="n">timekeeper</span>
<spanclass="p">{</span>
<spanclass="k">struct</span><span
class="n">tk_read_base</span><spanclass="n">tkr</span><span
class="p">;</span><spanclass="n">u64</span><span
class="n">xtime_sec</span><spanclass="p">;</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">ktime_sec</span><spanclass="p">;</span><span
class="k">struct</span><spanclass="n">timespec64</span><span
class="n">wall_to_monotonic</span><spanclass="p">;</span>
<spanclass="n">ktime_t</span><spanclass="n">offs_real</span>
<spanclass="p">;</span><spanclass="n">ktime_t</span><span
class="n">offs_boot</span><spanclass="p">;</span><span
class="n">ktime_t</span><spanclass="n">offs_tai</span><span
class="p">;</span><spanclass="n">s32</span><span
class="n">tai_offset</span><spanclass="p">;</span><span
class="n">ktime_t</span><spanclass="n">base_raw</span><span
class="p">;</span><spanclass="k">struct</span><span
class="n">timespec64</span><spanclass="n">raw_time</span>
<spanclass="p">;</span>
<spanclass="cm">/*Thefollowingmembersarefortimekeeping
internaluse*/</span>
<spanclass="n">cycle_t</span><span
class="n">cycle_interval</span><spanclass="p">;</span><span
class="n">u64</span><spanclass="n">xtime_interval</span><span
class="p">;</span><spanclass="n">s64</span><span
class="n">xtime_remainder</span><spanclass="p">;</span><span
class="n">u32</span><spanclass="n">raw_interval</span><span
class="p">;</span><spanclass="n">u64</span><span
class="n">ntp_tick</span><spanclass="p">;</span><span
class="cm">/*DifferencebetweenaccumulatedtimeandNTPtimein
ntp</span>
<spanclass="cm">*shiftednanoseconds.*/</span>
<spanclass="n">s64</span><spanclass="n">ntp_error</span>
<spanclass="p">;</span><spanclass="n">u32</span><span
class="n">ntp_error_shift</span><spanclass="p">;</span><span
class="n">u32</span><spanclass="n">ntp_err_mult</span><span
class="p">;</span><spanclass="p">};</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="n">u64</span><span
class="nf">timekeeping_delta_to_ns</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">tk_read_base</span><spanclass="o">*</span><span
class="n">tkr</span><spanclass="p">,</span><span
class="n">u64</span><spanclass="n">delta</span><span
class="p">)</span>
<spanclass="p">{</span>
<spanclass="n">u64</span><spanclass="n">nsec</span><span
class="p">;</span>
<spanclass="n">nsec</span><spanclass="o">=</span><span
class="n">delta</span><spanclass="o">*</span><span
class="n">tkr</span><spanclass="o">-></span><span
class="n">mult</span><spanclass="o">+</span><span
class="n">tkr</span><spanclass="o">-></span><span
class="n">xtime_nsec</span><spanclass="p">;</span>
<spanclass="n">nsec</span><spanclass="o">>>=</span><span
class="n">tkr</span><spanclass="o">-></span><span
class="n">shift</span><spanclass="p">;</span>
<spanclass="cm">/*Ifarchrequires,addinget_arch_timeoffset()
*/</span>
<spanclass="k">return</span><spanclass="n">nsec</span>
<spanclass="o">+</span><span
class="n">arch_gettimeoffset</span><spanclass="p">();</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="n">u64</span><spanclass="nf">timekeeping_get_ns</span>
<spanclass="p">(</span><spanclass="k">struct</span><span
class="n">tk_read_base</span><spanclass="o">*</span><span
class="n">tkr</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="n">u64</span><spanclass="n">delta</span><span
class="p">;</span>
<spanclass="n">delta</span><spanclass="o">=</span><span
class="n">timekeeping_get_delta</span><spanclass="p">(</span>
<spanclass="n">tkr</span><spanclass="p">);</span>
<spanclass="k">return</span><span
class="n">timekeeping_delta_to_ns</span><spanclass="p">
(</span><spanclass="n">tkr</span><spanclass="p">,</span><span
class="n">delta</span><spanclass="p">);</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="n">u64</span><span
class="nf">logarithmic_accumulation</span><spanclass="p">
(</span><spanclass="k">struct</span><span
class="n">timekeeper</span><spanclass="o">*</span><span
class="n">tk</span><spanclass="p">,</span><span
class="n">u64</span><spanclass="n">offset</span><span
class="p">,</span>
<spanclass="n">u32</span><spanclass="n">shift</span><span
class="p">,</span><spanclass="kt">unsigned</span><span
class="kt">int</span><spanclass="o">*</span><span
class="n">clock_set</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="n">u64</span><spanclass="n">interval</span>
<spanclass="o">=</span><spanclass="n">tk</span><span
class="o">-></span><spanclass="n">cycle_interval</span><span
class="o"><<</span><spanclass="n">shift</span><span
class="p">;</span>
<spanclass="n">u64</span><span
class="n">snsec_per_sec</span><spanclass="p">;</span>
<spanclass="cm">/*Iftheoffsetissmallerthanashiftedinterval,
donothing*/</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">offset</span><spanclass="o"><</span><span
class="n">interval</span><spanclass="p">)</span>
<spanclass="k">return</span><spanclass="n">offset</span>
<spanclass="p">;</span>
<spanclass="cm">/*Accumulateoneshiftedinterval*/</span>
<spanclass="n">offset</span><spanclass="o">-=</span><span
class="n">interval</span><spanclass="p">;</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">tkr_mono</span><spanclass="p">.</span><span
class="n">cycle_last</span><spanclass="o">+=</span><span
class="n">interval</span><spanclass="p">;</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">cycle_last</span><spanclass="o">+=</span><span
class="n">interval</span><spanclass="p">;</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">tkr_mono</span><spanclass="p">.</span><span
class="n">xtime_nsec</span><spanclass="o">+=</span><span
class="n">tk</span><spanclass="o">-></span><span
class="n">xtime_interval</span><spanclass="o"><<</span><span
class="n">shift</span><spanclass="p">;</span>
<spanclass="o">*</span><spanclass="n">clock_set</span>
<spanclass="o">|=</span><span
class="n">accumulate_nsecs_to_secs</span><spanclass="p">
(</span><spanclass="n">tk</span><spanclass="p">);</span>
<spanclass="cm">/*Accumulaterawtime*/</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">xtime_nsec</span><spanclass="o">+=</span><span
class="p">(</span><spanclass="n">u64</span><spanclass="p">)
</span><spanclass="n">tk</span><spanclass="o">-></span><span
class="n">raw_time</span><spanclass="p">.</span><span
class="n">tv_nsec</span><spanclass="o"><<</span><span
class="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">shift</span><spanclass="p">;</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">xtime_nsec</span><spanclass="o">+=</span><span
class="n">tk</span><spanclass="o">-></span><span
class="n">raw_interval</span><spanclass="o"><<</span><span
class="n">shift</span><spanclass="p">;</span>
<spanclass="n">snsec_per_sec</span><spanclass="o">=</span>
<spanclass="p">(</span><spanclass="n">u64</span><span
class="p">)</span><spanclass="n">NSEC_PER_SEC</span><span
class="o"><<</span><spanclass="n">tk</span><spanclass="o">->
</span><spanclass="n">tkr_raw</span><spanclass="p">.</span>
<spanclass="n">shift</span><spanclass="p">;</span>
<spanclass="k">while</span><spanclass="p">(</span><span
class="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">xtime_nsec</span><spanclass="o">>=</span><span
class="n">snsec_per_sec</span><spanclass="p">)</span><span
class="p">{</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">xtime_nsec</span><spanclass="o">-=</span><span
class="n">snsec_per_sec</span><spanclass="p">;</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">raw_time</span><spanclass="p">.</span><span
class="n">tv_sec</span><spanclass="o">++</span><span
class="p">;</span>
<spanclass="p">}</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">raw_time</span><spanclass="p">.</span><span
class="n">tv_nsec</span><spanclass="o">=</span><span
class="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">xtime_nsec</span><spanclass="o">>></span><span
class="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">shift</span><spanclass="p">;</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">xtime_nsec</span><spanclass="o">-=</span><span
class="p">(</span><spanclass="n">u64</span><spanclass="p">)
</span><spanclass="n">tk</span><spanclass="o">-></span><span
class="n">raw_time</span><spanclass="p">.</span><span
class="n">tv_nsec</span><spanclass="o"><<</span><span
class="n">tk</span><spanclass="o">-></span><span
class="n">tkr_raw</span><spanclass="p">.</span><span
class="n">shift</span><spanclass="p">;</span>
<spanclass="cm">/*AccumulateerrorbetweenNTPandclock
interval*/</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">ntp_error</span><spanclass="o">+=</span><span
class="n">tk</span><spanclass="o">-></span><span
class="n">ntp_tick</span><spanclass="o"><<</span><span
class="n">shift</span><spanclass="p">;</span>
<spanclass="n">tk</span><spanclass="o">-></span><span
class="n">ntp_error</span><spanclass="o">-=</span><span
class="p">(</span><spanclass="n">tk</span><spanclass="o">->
</span><spanclass="n">xtime_interval</span><spanclass="o">+
</span><spanclass="n">tk</span><spanclass="o">-></span><span
class="n">xtime_remainder</span><spanclass="p">)</span><span
class="o"><<</span>
<spanclass="p">(</span><spanclass="n">tk</span><span
class="o">-></span><spanclass="n">ntp_error_shift</span><span
class="o">+</span><spanclass="n">shift</span><spanclass="p">);
</span>
<spanclass="k">return</span><spanclass="n">offset</span>
<spanclass="p">;</span>
<spanclass="p">}</span>
Anotherroutineupdate_wall_time(),definedin
kernel/time/timekeeping.c,isresponsibleformaintainingthewall
time.Itincrementsthewalltimeusingthecurrentclocksourceas
reference.
<spanclass="k">struct</span><span
class="n">clock_event_device</span><spanclass="p">{</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">event_handler</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="kt">int</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">set_next_event</span><spanclass="p">)(</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">evt</span><spanclass="p">,</span><span
class="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="kt">int</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">set_next_ktime</span><spanclass="p">)(</span><span
class="n">ktime_t</span><spanclass="n">expires</span><span
class="p">,</span><spanclass="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="n">ktime_t</span><span
class="n">next_event</span><spanclass="p">;</span>
<spanclass="n">u64</span><span
class="n">max_delta_ns</span><spanclass="p">;</span>
<spanclass="n">u64</span><span
class="n">min_delta_ns</span><spanclass="p">;</span>
<spanclass="n">u32</span><spanclass="n">mult</span><span
class="p">;</span>
<spanclass="n">u32</span><spanclass="n">shift</span><span
class="p">;</span>
<spanclass="k">enum</span><span
class="n">clock_event_state</span><span
class="n">state_use_accessors</span><spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">features</span><spanclass="p">;</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">retries</span><spanclass="p">;</span>
<spanclass="kt">int</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">set_state_periodic</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="kt">int</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">set_state_oneshot</span><spanclass="p">)(</span><span
class="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="kt">int</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">set_state_oneshot_stopped</span><spanclass="p">)
(</span><spanclass="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="kt">int</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">set_state_shutdown</span><spanclass="p">)(</span>
<spanclass="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="kt">int</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">tick_resume</span><spanclass="p">)(</span><span
class="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">broadcast</span><span
class="p">)(</span><spanclass="k">const</span><span
class="k">struct</span><spanclass="n">cpumask</span><span
class="o">*</span><spanclass="n">mask</span><spanclass="p">);
</span><spanclass="kt">void</span><spanclass="p">(</span>
<spanclass="o">*</span><spanclass="n">suspend</span><span
class="p">)(</span><spanclass="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="kt">void</span><span
class="p">(</span><spanclass="o">*</span><span
class="n">resume</span><spanclass="p">)(</span><span
class="k">struct</span><span
class="n">clock_event_device</span><spanclass="o">*</span>
<spanclass="p">);</span><spanclass="kt">unsigned</span><span
class="kt">long</span><spanclass="n">min_delta_ticks</span>
<spanclass="p">;</span><spanclass="kt">unsigned</span><span
class="kt">long</span><spanclass="n">max_delta_ticks</span>
<spanclass="p">;</span>
<spanclass="k">const</span><spanclass="kt">char</span>
<spanclass="o">*</span><spanclass="n">name</span><span
class="p">;</span><spanclass="kt">int</span><span
class="n">rating</span><spanclass="p">;</span>
<spanclass="kt">int</span><spanclass="n">irq</span><span
class="p">;</span>
<spanclass="kt">int</span><spanclass="n">bound_on</span>
<spanclass="p">;</span>
<spanclass="k">const</span><spanclass="k">struct</span>
<spanclass="n">cpumask</span><spanclass="o">*</span><span
class="n">cpumask</span><spanclass="p">;</span><span
class="k">struct</span><spanclass="n">list_head</span><span
class="n">list</span><spanclass="p">;</span><span
class="k">struct</span><spanclass="n">module</span><span
class="o">*</span><spanclass="n">owner</span><spanclass="p">;
</span><spanclass="p">}</span><span
class="n">____cacheline_aligned</span><spanclass="p">;</span>
#defineCLOCK_EVT_FEAT_PERIODIC0x000001<br/>
<span>#defineCLOCK_EVT_FEAT_ONESHOT
0x000002<br/>#defineCLOCK_EVT_FEAT_KTIME
0x000004</span>
Periodicmodeconfiguresthehardwaregeneratethetickonceevery
1/HZseconds,whileone-shotmodemakesthehardwaregeneratethe
tickafterthepassageofaspecificnumberofcyclesfromthecurrent
time.
Dependingontheusecasesandtheoperatingmode,event_handler
couldbeanyofthesethreeroutines:
tick_handle_periodic(),whichisthedefaulthandlerfor
periodicticksandisdefinedinkernel/time/tick-common.c.
tick_nohz_handler()isthelow-resolutioninterrupthandler,
usedinlowresmode.It'sdefinedinkernel/time/tick-sched.c.
hrtimer_interrupt()isusedinhighresmodeandisdefinedin
kernel/time/hrtimer.c.Interruptsaredisabledwhenit'scalled.
Aclockeventdeviceisconfiguredandregisteredthroughtheroutine
clockevents_config_and_register(),definedin
kernel/time/clockevents.c.
enumtick_device_mode{<br/><span>
TICKDEV_MODE_PERIODIC,<br/>
TICKDEV_MODE_ONESHOT,<br/></span>};<br/><br/>
<span>structtick_device{<br/>structclock_event_device*evtdev;
<br/></span><span>enumtick_device_modemode;</span><br/>}
Atick_devicecouldbeeitherperiodicoroneshot.It'ssetthrough
theenumtick_device_mode.
Softwaretimersanddelayfunctions
Asoftwaretimerallowsafunctiontobeinvokedonexpiryofatimeduration.
Therearetwotypesoftimers:dynamictimersusedbythekernelandinterval
timersusedbytheuser-spaceprocesses.Apartfromsoftwaretimers,thereis
anothertypeofcommonlyusedtimingfunctioncalleddelayfunctions.Delay
functionsimplementapreciseloop,whichisexecutedasper(usuallyasmany
timesasthe)delayfunction'sargument.
<spanclass="k">struct</span><spanclass="n">timer_list</span>
<spanclass="p">{</span>
<spanclass="cm">/*</span>
<spanclass="cm">*Everyfieldthatchangesduringnormalruntime
groupedtothe</span>
<spanclass="cm">*samecacheline</span>
<spanclass="cm">*/</span>
<spanclass="k">struct</span><spanclass="n">hlist_node</span>
<spanclass="n">entry</span><spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">expires</span><spanclass="p">;</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">function</span><span
class="p">)(</span><spanclass="kt">unsigned</span><span
class="kt">long</span><spanclass="p">);</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">data</span><spanclass="p">;</span>
<spanclass="n">u32</span><spanclass="n">flags</span><span
class="p">;</span>
<spanclass="cp">#ifdefCONFIG_LOCKDEP</span>
<spanclass="k">struct</span><span
class="n">lockdep_map</span><span
class="n">lockdep_map</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="p">};</span>
<spanclass="kt">int</span><spanclass="nf">del_timer</span>
<spanclass="p">(</span><spanclass="k">struct</span><span
class="n">timer_list</span><spanclass="o">*</span><span
class="n">timer</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="k">struct</span><spanclass="n">tvec_base</span>
<spanclass="o">*</span><spanclass="n">base</span><span
class="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">flags</span><spanclass="p">;</span>
<spanclass="kt">int</span><spanclass="n">ret</span><span
class="o">=</span><spanclass="mi">0</span><spanclass="p">;
</span>
<spanclass="n">debug_assert_init</span><spanclass="p">
(</span><spanclass="n">timer</span><spanclass="p">);</span>
<spanclass="n">timer_stats_timer_clear_start_info</span><span
class="p">(</span><spanclass="n">timer</span><spanclass="p">);
</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">timer_pending</span><spanclass="p">(</span><span
class="n">timer</span><spanclass="p">))</span><spanclass="p">
{</span>
<spanclass="n">base</span><spanclass="o">=</span><span
class="n">lock_timer_base</span><spanclass="p">(</span><span
class="n">timer</span><spanclass="p">,</span><span
class="o">&</span><spanclass="n">flags</span><spanclass="p">);
</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">timer_pending</span><spanclass="p">(</span><span
class="n">timer</span><spanclass="p">))</span><spanclass="p">
{</span>
<spanclass="n">detach_timer</span><spanclass="p">(</span>
<spanclass="n">timer</span><spanclass="p">,</span><span
class="mi">1</span><spanclass="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">timer</span><spanclass="o">-></span><span
class="n">expires</span><spanclass="o">==</span><span
class="n">base</span><spanclass="o">-></span><span
class="n">next_timer</span><spanclass="o">&&</span>
<spanclass="o">!</span><span
class="n">tbase_get_deferrable</span><spanclass="p">(</span>
<spanclass="n">timer</span><spanclass="o">-></span><span
class="n">base</span><spanclass="p">))</span>
<spanclass="n">base</span><spanclass="o">-></span><span
class="n">next_timer</span><spanclass="o">=</span><span
class="n">base</span><spanclass="o">-></span><span
class="n">timer_jiffies</span><spanclass="p">;</span>
<spanclass="n">ret</span><spanclass="o">=</span><span
class="mi">1</span><spanclass="p">;</span>
<spanclass="p">}</span>
<spanclass="n">spin_unlock_irqrestore</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">base</span>
<spanclass="o">-></span><spanclass="n">lock</span><span
class="p">,</span><spanclass="n">flags</span><spanclass="p">);
</span>
<spanclass="p">}</span>
<spanclass="k">return</span><spanclass="n">ret</span><span
class="p">;</span>
<spanclass="p">}<br/><br/><br/></span><span
class="kt">int</span><spanclass="nf">del_timer_sync</span>
<spanclass="p">(</span><spanclass="k">struct</span><span
class="n">timer_list</span><spanclass="o">*</span><span
class="n">timer</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="cp">#ifdefCONFIG_LOCKDEP</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">flags</span><spanclass="p">;</span>
<spanclass="cm">/*</span>
<spanclass="cm">*Iflockdepgivesabacktracehere,please
reference</span>
<spanclass="cm">*thesynchronizationrulesabove.</span>
<spanclass="cm">*/</span>
<spanclass="n">local_irq_save</span><spanclass="p">(</span>
<spanclass="n">flags</span><spanclass="p">);</span>
<spanclass="n">lock_map_acquire</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">timer</span>
<spanclass="o">-></span><spanclass="n">lockdep_map</span>
<spanclass="p">);</span>
<spanclass="n">lock_map_release</span><spanclass="p">
(</span><spanclass="o">&</span><spanclass="n">timer</span>
<spanclass="o">-></span><spanclass="n">lockdep_map</span>
<spanclass="p">);</span>
<spanclass="n">local_irq_restore</span><spanclass="p">
(</span><spanclass="n">flags</span><spanclass="p">);</span>
<spanclass="cp">#endif</span>
<spanclass="cm">/*</span>
<spanclass="cm">*don'tuseitinhardirqcontext,becauseit</span>
<spanclass="cm">*couldleadtodeadlock.</span>
<spanclass="cm">*/</span>
<spanclass="n">WARN_ON</span><spanclass="p">(</span>
<spanclass="n">in_irq</span><spanclass="p">());</span>
<spanclass="k">for</span><spanclass="p">(;;)</span><span
class="p">{</span>
<spanclass="kt">int</span><spanclass="n">ret</span><span
class="o">=</span><spanclass="n">try_to_del_timer_sync</span>
<spanclass="p">(</span><spanclass="n">timer</span><span
class="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">ret</span><spanclass="o">>=</span><span
class="mi">0</span><spanclass="p">)</span>
<spanclass="k">return</span><spanclass="n">ret</span><span
class="p">;</span>
<spanclass="n">cpu_relax</span><spanclass="p">();</span>
<spanclass="p">}</span>
<spanclass="p">}<br/><br/></span>#define
del_singleshot_timer_sync(t)del_timer_sync(t)
del_timer()removesbothactiveandinactivetimers.Particularly
usefulinSMPsystems,del_timer_sync()deactivatesthetimerand
waitsuntilthehandlerhasfinishedexecutingonotherCPUs.
...<br/>del_timer(&t_obj);<br/>
<span>RESOURCE_DEALLOCATE();</span><br/>....
Thisapproach,however,isapplicabletouni-processorsystemsonly.
InanSMPsystem,it'squitepossiblethatwhenthetimerisstopped,
itsfunctionmightalreadyberunningonanotherCPU.Insucha
scenario,resourceswillbereleasedassoonasthedel_timer()
returns,whilethetimerfunctionisstillmanipulatingthemonother
CPU;notadesirablesituationatall.del_timer_sync()fixesthis
problem:afterstoppingthetimer,itwaitsuntilthetimerfunction
completesitsexecutionontheotherCPU.del_timer_sync()isuseful
incaseswherethetimerfunctioncanreactivateitself.Ifthetimer
functiondoesn'treactivatethetimer,amuchsimplerandfastermacro,
del_singleshot_timer_sync(),shouldbeusedinstead.
<spanclass="k">static</span><span
class="n">__latent_entropy</span><spanclass="kt">void</span>
<spanclass="nf">run_timer_softirq</span><spanclass="p">(</span>
<spanclass="k">struct</span><span
class="n">softirq_action</span><spanclass="o">*</span><span
class="n">h</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="k">struct</span><spanclass="n">timer_base</span>
<spanclass="o">*</span><spanclass="n">base</span><span
class="o">=</span><spanclass="n">this_cpu_ptr</span><span
class="p">(</span><spanclass="o">&</span><span
class="n">timer_bases</span><spanclass="p">[</span><span
class="n">BASE_STD</span><spanclass="p">]);</span>
<spanclass="n">base</span><spanclass="o">-></span><span
class="n">must_forward_clk</span><spanclass="o">=</span>
<spanclass="nb">false</span><spanclass="p">;</span>
<spanclass="n">__run_timers</span><spanclass="p">(</span>
<spanclass="n">base</span><spanclass="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">IS_ENABLED</span><spanclass="p">(</span><span
class="n">CONFIG_NO_HZ_COMMON</span><spanclass="p">)
</span><spanclass="o">&&</span><spanclass="n">base</span>
<spanclass="o">-></span><spanclass="n">nohz_active</span>
<spanclass="p">)</span>
<spanclass="n">__run_timers</span><spanclass="p">(</span>
<spanclass="n">this_cpu_ptr</span><spanclass="p">(</span><span
class="o">&</span><spanclass="n">timer_bases</span><span
class="p">[</span><spanclass="n">BASE_DEF</span><span
class="p">]));</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kr">inline</span><span
class="kt">void</span><spanclass="nf">ndelay</span><span
class="p">(</span><spanclass="kt">unsigned</span><span
class="kt">long</span><spanclass="n">x</span><spanclass="p">)
</span>
<spanclass="p">{</span>
<spanclass="n">udelay</span><spanclass="p">(</span><span
class="n">DIV_ROUND_UP</span><spanclass="p">(</span><span
class="n">x</span><spanclass="p">,</span><span
class="mi">1000</span><spanclass="p">));</span>
<spanclass="p">}</span>
<spanclass="k">static</span><spanclass="kt">void</span>
<spanclass="nf">ia64_itc_udelay</span><spanclass="p">(</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">usecs</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">start</span><spanclass="o">=</span><span
class="n">ia64_get_itc</span><spanclass="p">();</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">end</span><spanclass="o">=</span><span
class="n">start</span><spanclass="o">+</span><span
class="n">usecs</span><spanclass="o">*</span><span
class="n">local_cpu_data</span><spanclass="o">-></span><span
class="n">cyc_per_usec</span><spanclass="p">;</span>
<spanclass="k">while</span><spanclass="p">(</span><span
class="n">time_before</span><spanclass="p">(</span><span
class="n">ia64_get_itc</span><spanclass="p">(),</span><span
class="n">end</span><spanclass="p">))</span>
<spanclass="n">cpu_relax</span><spanclass="p">();</span>
<spanclass="p">}</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">ia64_udelay</span><span
class="p">)(</span><spanclass="kt">unsigned</span><span
class="kt">long</span><spanclass="n">usecs</span><span
class="p">)</span><spanclass="o">=</span><spanclass="o">&
</span><spanclass="n">ia64_itc_udelay</span><spanclass="p">;
</span>
<spanclass="kt">void</span>
<spanclass="nf">udelay</span><spanclass="p">(</span><span
class="kt">unsigned</span><spanclass="kt">long</span><span
class="n">usecs</span><spanclass="p">)</span>
<spanclass="p">{</span>
<spanclass="p">(</span><spanclass="o">*</span><span
class="n">ia64_udelay</span><spanclass="p">)(</span><span
class="n">usecs</span><spanclass="p">);</span>
<spanclass="p">}</span>
POSIXclocks
POSIXprovidessoftwaretimerstomultithreadedandreal-timeuserspace
applications,knownasPOSIXtimers.POSIXprovidesthefollowingclocks:
CLOCK_REALTIME:Thisclockrepresentstherealtimeinthesystem.Alsoknown
asthewalltime,it'ssimilartothetimefromawallclockandusedfor
timestampingaswellasprovidingactualtimetotheuser.Thisclockis
modifiable.
CLOCK_MONOTONIC:Thisclockkeepsthetimeelapsedsincethesystembootup.
It'severincreasingandnonmodifiablebyanyprocessoruser.Duetoits
monotonicnature,it'sthethepreferredclocktodeterminethetime
differencebetweentwotimeevents.
CLOCK_BOOTTIME:ThisclockisidenticaltoCLOCK_MONOTONIC;however,
itincludestimespentinsuspend.
Theseclockscanbeaccessedandmodified(iftheselectedclockallowsit)
throughthefollowingPOSIXclockroutines,definedinthetime.hheader:
intclock_getres(clockid_tclk_id,structtimespec*res);
intclock_gettime(clockid_tclk_id,structtimespec*tp);
intclock_settime(clockid_tclk_id,conststructtimespec*tp);
Thefunctionclock_getres()getstheresolution(precision)oftheclockspecified
byclk_id.Andiftheresolutionisnon-null,itstoresitinthestructtimespec
pointedtobytheresolution.Functionsclock_gettime()andclock_settime()readand
setthetimeoftheclockspecifiedbyclk_id.clk_idcouldbeanyofthePOSIX
clocks:CLOCK_REALTIME,CLOCK_MONOTONIC,andsoon.
CLOCK_REALTIME_COARSE
CLOCK_MONOTONIC_COARSE
EachofthesePOSIXroutineshascorrespondingsystemcalls,namely
sys_clock_getres(),sys_clock_gettime(),andsys_clock_settime.Soeverytimeanyof
theseroutinesisinvoked,acontextswitchingoccursfromusermodetokernel
mode.Ifcallstotheseroutinesarefrequent,contextswitchingcanresultinlow
systemperformance.Toavoidcontextswitching,twocoarsevariantsofthe
POSIXclockwereimplementedasthevDSO(virtualDynamicSharedObject)
library:
vDSOisasmallsharedlibrarywithselectedkernelspaceroutinesthatthekernel
mapsintotheaddressspaceofuser-spaceapplicationssothatthesekernel-space
routinescanbecalledbytheminprocessfromuserspacedirectly.TheClibrary
callsthevDSOs,sotheuserspaceapplicationscanbeprogrammedintheusual
waythroughstandardfunctionsandtheClibrarywillutilizethefunctionalities
availablethroughvDSOwithoutengaginganysyscallinterface,thusavoiding
anyusermode-kernelmodecontextswitchingandsyscalloverhead.Beingan
vDSOimplementation,thesecoarsevariantsarefasterandhavearesolutionof1
milliseconds.
Summary
Inthischapter,welookedindetailatmostoftheroutinesthatthekernel
providestodrivetime-basedevents,inadditiontocomprehendingthe
fundamentalaspectsofLinuxtime,itsinfrastructure,anditsmeasurement.We
alsobrieflylookedatPOSIXclocksandsomeoftheirkeytimeaccessand
modificationroutines.Effectivetime-drivenprogramshoweverrestoncareful
andcalculateduseoftheseroutines.
Inthenextchapter,wewillbrieflylookatthemanagementofdynamickernel
modules.
ModuleManagement
Kernelmodules(alsoreferredasLKMs)haveaccentuatedthedevelopmentof
kernelservicesowingtotheireaseofuse.Ourfocusthroughthischapterwillbe
tounderstandhowthekernelseamlesslyfacilitatesthisentireprocess,making
loadingandunloadingofmodulesdynamicandeasy,aswelookthroughallcore
concepts,functionsandimportantdatastructuresinvolvedinmodule
management.Weassumereadersarefamiliarwiththebasicusageofmodules.
Inthischapter,wewillcoverthefollowingtopics:
Keyelementsofakernelmodule
Modulelayout
Moduleloadandunloadinterfaces
Keydatastructures
Kernelmodules
Kernelmoduleisaneasyandeffectivemechanismtoextendthefunctionalityof
arunningsystemwithoutthebaggageofrebuildingthewholekernel,theyhave
beenvitalinusheringdynamismandscalabilitytotheLinuxoperatingsystem.
Kernelmodulesnotonlysatiatetheextendablenatureofthekernelbutalso
usherthefollowingfunctionalities:
Allowingkerneltheabilitytoonlykeepfeatureswhicharenecessary,in-
turnboostingcapacityutilization
Allowingproprietary/non-GPLcompliantservicestoloadandunload
Thebottom-linefeatureofextensibilityofthekernel
ElementsofanLKM
Eachmoduleobjectcomprisesoftheinit(constructor)andexit(destructor)
routines.Theinitroutineisinvokedwhenamoduleisdeployedintokernel
addressspace,andtheexitroutineiscalledwhilethemoduleisbeingremoved.
Asthenameinnatelysuggests,theinitroutineisusuallyprogrammedtocarry
outoperationsandactionswhichareessentialtosetupthemodulebody:suchas
registeringwithaspecifickernelsubsystemorallocatingresourcesthatare
essentialforthefunctionalitybeingloaded.However,specificoperations
programmedwithintheinitandexitroutinesdependonwhatthemoduleis
designedforandthefunctionalityitbringstothekernel.Thefollowingcode
excerptshowstemplateoftheinitandexitroutines:intinit_module(void)
{
/*performrequiredsetupandregistrationops*/
...
...
return0;
}
voidcleanup_module(void)
{
/*performrequiredcleanupoperations*/
...
...
}
Noticethattheinitroutinereturnsaninteger—azeroisreturnedifthemoduleis
committedtothekerneladdressspaceandanegativenumberisreturnedifit
fails.Thisadditionallyprovidesflexibilityforprogrammerstocommitamodule
onlywhenitsucceedsinregisteringwiththerequiredsubsystem.
Thedefaultnamesfortheinitandexitroutinesareinit_module()and
cleanup_module(),respectively.Modulescanoptionallychangenamesfortheinit
andexitroutinestoimprovecodereadability.However,theywillhavetodeclare
themusingthemodule_initandmodule_exitmacros:intmyinit(void)
{
...
...
return0;
}
voidmyexit(void)
{
...
...
}
module_init(myinit);
module_exit(myexit);
Commentmacrosformanotherkeyelementofamodulecode.Thesemacrosare
usedtoprovideusage,licence,andauthorinformationofthemodule.Thisis
importantasmodulesaresourcedfromvariousvendors:
MODULE_DESCRIPTION():Thismacroisusedtospecifythegeneraldescriptionof
themodule
MODULE_AUTHOR():Thisisusedtoprovideauthorinformation
MODULE_LICENSE():Thisisusedtospecifylegallicenceforthecodeinthe
module
Alltheinformationspecifiedthroughthesemacrosisretainedintothemodule
binaryandcanbeaccessedbyusersthroughautilitycalledmodinfo.
MODULE_LICENSE()istheonlymandatorymacrothatamodulemustmention.This
servesaveryhandypurpose,asitinformsusersaboutproprietarycodeina
module,whichissusceptibletodebuggingandsupportissues(kernel
communityinallprobabilityignoresissuesarisingoutofproprietarymodules).
Anotherusefulfeatureavailableformodulesisofdynamicinitializationof
moduledatavariablesusingmoduleparameters.Thisallowsdatavariables
declaredinamoduletobeinitializedeitherduringmoduledeploymentorwhen
moduleisliveinmemory(throughthesysfsinterface).Thiscanbeachievedby
settingupselectedvariablesasmoduleparametersthroughtheappropriate
module_param()familyofmacros(foundinkernelheader<linux/moduleparam.h>).
Valuespassedtomoduleparametersduringdeploymentofthemoduleare
initializedbeforetheinitfunctionisinvoked.
Codeinmodulescanaccessglobalkernelfunctionsanddataasneeded.This
enablesthecodeofthemoduletomakeuseofexistingkernelfunctionality.Itis
throughsuchfunctioncallsamodulecanperformrequiredoperationssuchas
printingmessagesintokernellogbuffer,allocationandde-allocationofmemory,
acquiringandreleasingofexclusionlocks,andregisteringandunregistering
modulecodewithappropriatesubsystem.
Similarly,amodulecanalsoexportitssymbolsintotheglobalsymboltableof
thekernel,whichcanthenbeaccessedfromcodeinothermodules.This
facilitatesgranulardesignandimplementationofkernelservicesbyorganizing
themacrossasetofmodules,insteadofhavingthewholeserviceimplemented
asasingleLKM.Suchstackingupofrelatedservicesleadstomodule
dependency,forinstance:ifmoduleAisusingthesymbolsofmoduleB,thenA
hasdependencyonB,inthatcase,moduleBmustbeloadedbeforemoduleA
andandmoduleBcannotbeunloadeduntilmoduleAisunloaded.
BinarylayoutofaLKM
Modulesarebuiltusingkbuildmakefiles;oncethebuildprocesscompletes,an
ELFbinaryfilewitha.ko(kernelobject)extensionisgenerated.ModuleELF
binariesareappropriatelytweakedtoaddnewsections,todifferentiatethem
fromotherELFbinaries,andtostoremodule-relatedmetadata.Thefollowing
arethesectionsinakernelmodule:
.gnu.linkonce.this_module Modulestructure
.modinfo Informationaboutthemodule(Licensesandsoon)
__versions Expectedversionsofsymbolsthatthemodule
dependsonduringcompiletime
__ksymtab* Thetableofsymbolsexportedbythismodule
__kcrctab* Thetableofversionsofsymbolsexportedbythis
module
.init Sectionsusedwheninitializing
.text,.dataetc. Codeanddatasections
SYSCALL_DEFINE3(finit_module,int,fd,constchar__user*,
uargs,int,flags)<br/>{<br/>structload_infoinfo={};<br/>loff_t
size;<br/>void*hdr;<br/>interr;<br/><br/>err=
may_init_module();<br/>if(err)<br/>returnerr;<br/><br/>
pr_debug("finit_module:fd=%d,uargs=%p,flags=%i\n",fd,uargs,
flags);<br/><br/>if(flags&~
(MODULE_INIT_IGNORE_MODVERSIONS<br/>
|MODULE_INIT_IGNORE_VERMAGIC))<br/>return-EINVAL;
<br/><br/>err=kernel_read_file_from_fd(fd,&hdr,&size,
INT_MAX,<br/>READING_MODULE);<br/>if(err)<br/>return
err;<br/>info.hdr=hdr;<br/>info.len=size;<br/><br/>return
<strong>load_module(&info,uargs,flags)</strong>;<br/>}
staticintload_module(structload_info*info,constchar__user
*uargs,intflags)<br/>{<br/>structmodule*mod;<br/>longerr;<br/>
char*after_dashes;<br/><br/>err=module_sig_check(info,flags);
<br/>if(err)<br/>gotofree_copy;<br/><br/>err=
elf_header_check(info);<br/>if(err)<br/>gotofree_copy;<br/><br/>
/*Figureoutmodulelayout,andallocateallthememory.*/<br/>mod
=layout_and_allocate(info,flags);<br/>if(IS_ERR(mod)){<br/>err
=PTR_ERR(mod);<br/>gotofree_copy;<br/>}<br/><br/>....<br/>
....<br/>....<br/><br/>}
<spanclass="n">SYSCALL_DEFINE2</span><spanclass="p">
(</span><spanclass="n">delete_module</span><spanclass="p">,
</span><spanclass="k">const</span><span
class="kt">char</span><spanclass="n">__user</span><span
class="o">*</span><spanclass="p">,</span><span
class="n">name_user</span><spanclass="p">,</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="p">,</span><spanclass="n">flags</span><span
class="p">)</span>
<spanclass="p">{</span>
<spanclass="k">struct</span><spanclass="n">module</span>
<spanclass="o">*</span><spanclass="n">mod</span><span
class="p">;</span>
<spanclass="kt">char</span><spanclass="n">name</span><span
class="p">[</span><span
class="n">MODULE_NAME_LEN</span><spanclass="p">];
</span>
<spanclass="kt">int</span><spanclass="n">ret</span><span
class="p">,</span><spanclass="n">forced</span><span
class="o">=</span><spanclass="mi">0</span><spanclass="p">;
</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="o">!</span><spanclass="n">capable</span><span
class="p">(</span><spanclass="n">CAP_SYS_MODULE</span>
<spanclass="p">)</span><spanclass="o">||</span><span
class="n">modules_disabled</span><spanclass="p">)</span>
<spanclass="k">return</span><spanclass="o">-</span><span
class="n">EPERM</span><spanclass="p">;</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">strncpy_from_user</span><spanclass="p">(</span><span
class="n">name</span><spanclass="p">,</span><span
class="n">name_user</span><spanclass="p">,</span><span
class="n">MODULE_NAME_LEN</span><spanclass="o">-
</span><spanclass="mi">1</span><spanclass="p">)</span><span
class="o"><</span><spanclass="mi">0</span><spanclass="p">)
</span>
<spanclass="k">return</span><spanclass="o">-</span><span
class="n">EFAULT</span><spanclass="p">;</span>
<spanclass="n">name</span><spanclass="p">[</span><span
class="n">MODULE_NAME_LEN</span><spanclass="o">-
</span><spanclass="mi">1</span><spanclass="p">]</span><span
class="o">=</span><spanclass="sc">'\0'</span><spanclass="p">;
</span>
<spanclass="n">audit_log_kern_module</span><spanclass="p">
(</span><spanclass="n">name</span><spanclass="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">mutex_lock_interruptible</span><spanclass="p">
(</span><spanclass="o">&</span><span
class="n">module_mutex</span><spanclass="p">)</span><span
class="o">!=</span><spanclass="mi">0</span><spanclass="p">)
</span>
<spanclass="k">return</span><spanclass="o">-</span><span
class="n">EINTR</span><spanclass="p">;</span>
<spanclass="n">mod</span><spanclass="o">=</span><span
class="n">find_module</span><spanclass="p">(</span><span
class="n">name</span><spanclass="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="o">!</span><spanclass="n">mod</span><spanclass="p">)
</span><spanclass="p">{</span>
<spanclass="n">ret</span><spanclass="o">=</span><span
class="o">-</span><spanclass="n">ENOENT</span><span
class="p">;</span>
<spanclass="k">goto</span><spanclass="n">out</span><span
class="p">;</span>
<spanclass="p">}</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="o">!</span><spanclass="n">list_empty</span><span
class="p">(</span><spanclass="o">&</span><span
class="n">mod</span><spanclass="o">-></span><span
class="n">source_list</span><spanclass="p">))</span><span
class="p">{</span>
<spanclass="cm">/*Othermodulesdependonus:getridofthem
first.*/</span>
<spanclass="n">ret</span><spanclass="o">=</span><span
class="o">-</span><spanclass="n">EWOULDBLOCK</span>
<spanclass="p">;</span>
<spanclass="k">goto</span><spanclass="n">out</span><span
class="p">;</span>
<spanclass="p">}</span>
<spanclass="cm">/*Doinginitoralreadydying?*/</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">mod</span><spanclass="o">-></span><span
class="n">state</span><spanclass="o">!=</span><span
class="n">MODULE_STATE_LIVE</span><spanclass="p">)
</span><spanclass="p">{</span>
<spanclass="cm">/*FIXME:if(force),slammodulecountdamn
thetorpedoes*/</span>
<spanclass="n">pr_debug</span><spanclass="p">(</span><span
class="s">"%salreadydying</span><spanclass="se">\n</span>
<spanclass="s">"</span><spanclass="p">,</span><span
class="n">mod</span><spanclass="o">-></span><span
class="n">name</span><spanclass="p">);</span>
<spanclass="n">ret</span><spanclass="o">=</span><span
class="o">-</span><spanclass="n">EBUSY</span><span
class="p">;</span>
<spanclass="k">goto</span><spanclass="n">out</span><span
class="p">;</span>
<spanclass="p">}</span>
<spanclass="cm">/*Ifithasaninitfunc,itmusthaveanexitfunc
tounload*/</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">mod</span><spanclass="o">-></span><span
class="n">init</span><spanclass="o">&&</span><span
class="o">!</span><spanclass="n">mod</span><spanclass="o">->
</span><spanclass="n">exit</span><spanclass="p">)</span><span
class="p">{</span>
<spanclass="n">forced</span><spanclass="o">=</span><span
class="n">try_force_unload</span><spanclass="p">(</span><span
class="n">flags</span><spanclass="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="o">!</span><spanclass="n">forced</span><spanclass="p">)
</span><spanclass="p">{</span>
<spanclass="cm">/*Thismodulecan'tberemoved*/</span>
<spanclass="n">ret</span><spanclass="o">=</span><span
class="o">-</span><spanclass="n">EBUSY</span><span
class="p">;</span>
<spanclass="k">goto</span><spanclass="n">out</span><span
class="p">;</span>
<spanclass="p">}</span>
<spanclass="p">}</span>
<spanclass="cm">/*Stopthemachinesorefcountscan'tmoveand
disablemodule.*/</span>
<spanclass="n">ret</span><spanclass="o">=</span><span
class="n">try_stop_module</span><spanclass="p">(</span><span
class="n">mod</span><spanclass="p">,</span><span
class="n">flags</span><spanclass="p">,</span><spanclass="o">&
</span><spanclass="n">forced</span><spanclass="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">ret</span><spanclass="o">!=</span><span
class="mi">0</span><spanclass="p">)</span>
<spanclass="k">goto</span><spanclass="n">out</span><span
class="p">;</span>
<spanclass="n">mutex_unlock</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">module_mutex</span>
<spanclass="p">);</span>
<spanclass="cm">/*Finaldestructionnownooneisusingit.
*/</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">mod</span><spanclass="o">-></span><span
class="n">exit</span><spanclass="o">!=</span><span
class="nb">NULL</span><spanclass="p">)</span>
<spanclass="n">mod</span><spanclass="o">-></span><span
class="n">exit</span><spanclass="p">();</span>
<spanclass="n">blocking_notifier_call_chain</span><span
class="p">(</span><spanclass="o">&</span><span
class="n">module_notify_list</span><spanclass="p">,</span>
<spanclass="n">MODULE_STATE_GOING</span><span
class="p">,</span><spanclass="n">mod</span><spanclass="p">);
</span>
<spanclass="n">klp_module_going</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="n">ftrace_release_mod</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="n">async_synchronize_full</span><spanclass="p">
();</span>
<spanclass="cm">/*Storethenameofthelastunloadedmodule
fordiagnosticpurposes*/</span>
<spanclass="n">strlcpy</span><spanclass="p">(</span><span
class="n">last_unloaded_module</span><spanclass="p">,</span>
<spanclass="n">mod</span><spanclass="o">-></span><span
class="n">name</span><spanclass="p">,</span><span
class="k">sizeof</span><spanclass="p">(</span><span
class="n">last_unloaded_module</span><spanclass="p">));</span>
<strong><spanclass="n">free_module</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
</strong>
<spanclass="k">return</span><spanclass="mi">0</span><span
class="p">;</span>
<spanclass="nl">out</span><spanclass="p">:</span>
<spanclass="n">mutex_unlock</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">module_mutex</span>
<spanclass="p">);</span>
<spanclass="k">return</span><spanclass="n">ret</span><span
class="p">;</span>
<spanclass="p">}</span>
<spanclass="cm">/*Freeamodule,removefromlists,etc.*/</span>
<spanclass="k">static</span><spanclass="kt">void</span><span
class="nf">free_module</span><spanclass="p">(</span><span
class="k">struct</span><spanclass="n">module</span><span
class="o">*</span><spanclass="n">mod</span><spanclass="p">)
</span>
<spanclass="p">{</span>
<spanclass="n">trace_module_free</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="n">mod_sysfs_teardown</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="cm">/*Weleaveitinlisttopreventduplicateloads,
butmakesure</span>
<spanclass="cm">*that</span>noone<spanclass="cm">usesit
whileit'sbeingdeconstructed.*/</span>
<spanclass="n">mutex_lock</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">module_mutex</span>
<spanclass="p">);</span>
<spanclass="n">mod</span><spanclass="o">-></span><span
class="n">state</span><spanclass="o">=</span><span
class="n">MODULE_STATE_UNFORMED</span><span
class="p">;</span>
<spanclass="n">mutex_unlock</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">module_mutex</span>
<spanclass="p">);</span>
<spanclass="cm">/*Removedynamicdebuginfo*/</span>
<spanclass="n">ddebug_remove_module</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="o">-></span>
<spanclass="n">name</span><spanclass="p">);</span>
<spanclass="cm">/*Arch-specificcleanup.*/</span>
<spanclass="n">module_arch_cleanup</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="cm">/*Moduleunloadstuff*/</span>
<spanclass="n">module_unload_free</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="cm">/*Freeanyallocatedparameters.*/</span>
<spanclass="n">destroy_params</span><spanclass="p">(</span>
<spanclass="n">mod</span><spanclass="o">-></span><span
class="n">kp</span><spanclass="p">,</span><span
class="n">mod</span><spanclass="o">-></span><span
class="n">num_kp</span><spanclass="p">);</span>
<spanclass="k">if</span><spanclass="p">(</span><span
class="n">is_livepatch_module</span><spanclass="p">(</span>
<spanclass="n">mod</span><spanclass="p">))</span>
<spanclass="n">free_module_elf</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="cm">/*Nowwecandeleteitfromthelists*/</span>
<spanclass="n">mutex_lock</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">module_mutex</span>
<spanclass="p">);</span>
<spanclass="cm">/*Unlinkcarefully:kallsymscouldbewalking
list.*/</span>
<spanclass="n">list_del_rcu</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">mod</span><span
class="o">-></span><spanclass="n">list</span><spanclass="p">);
</span>
<spanclass="n">mod_tree_remove</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="cm">/*Removethismodulefrombuglist,thisuses
list_del_rcu*/</span>
<spanclass="n">module_bug_cleanup</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="cm">/*WaitforRCU-schedsynchronizingbefore
releasingmod->listandbuglist.*/</span>
<spanclass="n">synchronize_sched</span><spanclass="p">();
</span>
<spanclass="n">mutex_unlock</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">module_mutex</span>
<spanclass="p">);</span>
<spanclass="cm">/*Thismaybeempty,butthat'sOK*/</span>
<spanclass="n">disable_ro_nx</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">mod</span><span
class="o">-></span><spanclass="n">init_layout</span><span
class="p">);</span>
<spanclass="n">module_arch_freeing_init</span><span
class="p">(</span><spanclass="n">mod</span><spanclass="p">);
</span>
<spanclass="n">module_memfree</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="o">-></span>
<spanclass="n">init_layout</span><spanclass="p">.</span><span
class="n">base</span><spanclass="p">);</span>
<spanclass="n">kfree</span><spanclass="p">(</span><span
class="n">mod</span><spanclass="o">-></span><span
class="n">args</span><spanclass="p">);</span>
<spanclass="n">percpu_modfree</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="p">);</span>
<spanclass="cm">/*Freelock-classes;reliesonthepreceding
sync_rcu().*/</span>
<spanclass="n">lockdep_free_key_range</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="o">-></span>
<spanclass="n">core_layout</span><spanclass="p">.</span><span
class="n">base</span><spanclass="p">,</span><span
class="n">mod</span><spanclass="o">-></span><span
class="n">core_layout</span><spanclass="p">.</span><span
class="n">size</span><spanclass="p">);</span>
<spanclass="cm">/*Finally,freethecore(containingthemodule
structure)*/</span>
<spanclass="n">disable_ro_nx</span><spanclass="p">(</span>
<spanclass="o">&</span><spanclass="n">mod</span><span
class="o">-></span><spanclass="n">core_layout</span><span
class="p">);</span>
<spanclass="n">module_memfree</span><spanclass="p">
(</span><spanclass="n">mod</span><spanclass="o">-></span>
<spanclass="n">core_layout</span><spanclass="p">.</span><span
class="n">base</span><spanclass="p">);</span>
<spanclass="cp">#ifdefCONFIG_MPU</span>
<spanclass="n">update_protections</span><spanclass="p">
(</span><spanclass="n">current</span><spanclass="o">-></span>
<spanclass="n">mm</span><spanclass="p">);</span>
<spanclass="cp">#endif</span>
<spanclass="p">}</span>
Thiscallremovesthemodulefromthevariouslistswhereitwas
placedduringloading(sysfs,modulelist,andsoon)toinitiatethe
cleanup.Anarchitecture-specificcleanuproutineisinvoked(canbe
foundin</linux/arch/<arch>/kernel/module.c>).Alldependent
modulesareiteratedandthemoduleisremovedfromtheirlists.As
soonasthecleanupisover,allresourcesandthememorythatwas
allocatedtothemodulearefreed.
<spanclass="k">struct</span><spanclass="n">module</span>
<spanclass="p">{</span>
<spanclass="k">enum</span><span
class="n">module_state</span><spanclass="n">state</span><span
class="p">;</span>
<spanclass="cm">/*Memberoflistofmodules*/</span>
<spanclass="k">struct</span><spanclass="n">list_head</span>
<spanclass="n">list</span><spanclass="p">;</span>
<spanclass="cm">/*Uniquehandleforthismodule*/</span>
<spanclass="kt">char</span><spanclass="n">name</span><span
class="p">[</span><span
class="n">MODULE_NAME_LEN</span><spanclass="p">];
</span>
<spanclass="cm">/*Sysfsstuff.*/</span>
<spanclass="k">struct</span><span
class="n">module_kobject</span><spanclass="n">mkobj</span>
<spanclass="p">;</span>
<spanclass="k">struct</span><span
class="n">module_attribute</span><spanclass="o">*</span><span
class="n">modinfo_attrs</span><spanclass="p">;</span>
<spanclass="k">const</span><spanclass="kt">char</span>
<spanclass="o">*</span><spanclass="n">version</span><span
class="p">;</span>
<spanclass="k">const</span><spanclass="kt">char</span>
<spanclass="o">*</span><spanclass="n">srcversion</span><span
class="p">;</span>
<spanclass="k">struct</span><spanclass="n">kobject</span>
<spanclass="o">*</span><spanclass="n">holders_dir</span><span
class="p">;</span>
<spanclass="cm">/*Exportedsymbols*/</span>
<spanclass="k">const</span><spanclass="k">struct</span>
<spanclass="n">kernel_symbol</span><spanclass="o">*</span>
<spanclass="n">syms</span><spanclass="p">;</span>
<spanclass="k">const</span><spanclass="n">s32</span><span
class="o">*</span><spanclass="n">crcs</span><spanclass="p">;
</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_syms</span><spanclass="p">;</span>
<spanclass="cm">/*Kernelparameters.*/</span>
<spanclass="cp">#ifdefCONFIG_SYSFS</span>
<spanclass="k">struct</span><spanclass="n">mutex</span>
<spanclass="n">param_lock</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="k">struct</span><span
class="n">kernel_param</span><spanclass="o">*</span><span
class="n">kp</span><spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_kp</span><spanclass="p">;</span>
<spanclass="cm">/*GPL-onlyexportedsymbols.*/</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_gpl_syms</span><spanclass="p">;</span>
<spanclass="k">const</span><spanclass="k">struct</span>
<spanclass="n">kernel_symbol</span><spanclass="o">*</span>
<spanclass="n">gpl_syms</span><spanclass="p">;</span>
<spanclass="k">const</span><spanclass="n">s32</span><span
class="o">*</span><spanclass="n">gpl_crcs</span><span
class="p">;</span>
<spanclass="cp">#ifdefCONFIG_UNUSED_SYMBOLS</span>
<spanclass="cm">/*unusedexportedsymbols.*/</span>
<spanclass="k">const</span><spanclass="k">struct</span>
<spanclass="n">kernel_symbol</span><spanclass="o">*</span>
<spanclass="n">unused_syms</span><spanclass="p">;</span>
<spanclass="k">const</span><spanclass="n">s32</span><span
class="o">*</span><spanclass="n">unused_crcs</span><span
class="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_unused_syms</span><spanclass="p">;
</span>
<spanclass="cm">/*GPL-only,unusedexportedsymbols.
*/</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_unused_gpl_syms</span><spanclass="p">;
</span>
<spanclass="k">const</span><spanclass="k">struct</span>
<spanclass="n">kernel_symbol</span><spanclass="o">*</span>
<spanclass="n">unused_gpl_syms</span><spanclass="p">;</span>
<spanclass="k">const</span><spanclass="n">s32</span><span
class="o">*</span><spanclass="n">unused_gpl_crcs</span><span
class="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_MODULE_SIG</span>
<spanclass="cm">/*Signaturewasverified.*/</span>
<spanclass="kt">bool</span><spanclass="n">sig_ok</span>
<spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="kt">bool</span><span
class="n">async_probe_requested</span><spanclass="p">;</span>
<spanclass="cm">/*symbolsthatwillbeGPL-onlyinthenear
future.*/</span>
<spanclass="k">const</span><spanclass="k">struct</span>
<spanclass="n">kernel_symbol</span><spanclass="o">*</span>
<spanclass="n">gpl_future_syms</span><spanclass="p">;</span>
<spanclass="k">const</span><spanclass="n">s32</span><span
class="o">*</span><spanclass="n">gpl_future_crcs</span><span
class="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_gpl_future_syms</span><spanclass="p">;
</span>
<spanclass="cm">/*Exceptiontable*/</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_exentries</span><spanclass="p">;</span>
<spanclass="k">struct</span><span
class="n">exception_table_entry</span><spanclass="o">*</span>
<spanclass="n">extable</span><spanclass="p">;</span>
<spanclass="cm">/*Startupfunction.*/</span>
<spanclass="kt">int</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">init</span><spanclass="p">)
(</span><spanclass="kt">void</span><spanclass="p">);</span>
<spanclass="cm">/*Corelayout:rbtreeisaccessedfrequently,so
keeptogether.*/</span>
<spanclass="k">struct</span><span
class="n">module_layout</span><span
class="n">core_layout</span><span
class="n">__module_layout_align</span><spanclass="p">;</span>
<spanclass="k">struct</span><span
class="n">module_layout</span><span
class="n">init_layout</span><spanclass="p">;</span>
<spanclass="cm">/*Arch-specificmodulevalues*/</span>
<spanclass="k">struct</span><span
class="n">mod_arch_specific</span><spanclass="n">arch</span>
<spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="n">taints</span><spanclass="p">;</span><span
class="cm">/*samebitsaskernel:taint_flags*/</span>
<spanclass="cp">#ifdefCONFIG_GENERIC_BUG</span>
<spanclass="cm">/*SupportforBUG*/</span>
<spanclass="kt">unsigned</span><span
class="n">num_bugs</span><spanclass="p">;</span>
<spanclass="k">struct</span><spanclass="n">list_head</span>
<spanclass="n">bug_list</span><spanclass="p">;</span>
<spanclass="k">struct</span><spanclass="n">bug_entry</span>
<spanclass="o">*</span><spanclass="n">bug_table</span><span
class="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_KALLSYMS</span>
<spanclass="cm">/*ProtectedbyRCUand/ormodule_mutex:use
rcu_dereference()*/</span>
<spanclass="k">struct</span><span
class="n">mod_kallsyms</span><spanclass="o">*</span><span
class="n">kallsyms</span><spanclass="p">;</span>
<spanclass="k">struct</span><span
class="n">mod_kallsyms</span><span
class="n">core_kallsyms</span><spanclass="p">;</span>
<spanclass="cm">/*Sectionattributes*/</span>
<spanclass="k">struct</span><span
class="n">module_sect_attrs</span><spanclass="o">*</span>
<spanclass="n">sect_attrs</span><spanclass="p">;</span>
<spanclass="cm">/*Notesattributes*/</span>
<spanclass="k">struct</span><span
class="n">module_notes_attrs</span><spanclass="o">*</span>
<spanclass="n">notes_attrs</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cm">/*Thecommandlinearguments(maybe
mangled).Peoplelike</span>
<spanclass="cm">keepingpointerstothisstuff*/</span>
<spanclass="kt">char</span><spanclass="o">*</span><span
class="n">args</span><spanclass="p">;</span>
<spanclass="cp">#ifdefCONFIG_SMP</span>
<spanclass="cm">/*Per-cpudata.*/</span>
<spanclass="kt">void</span><spanclass="n">__percpu</span>
<spanclass="o">*</span><spanclass="n">percpu</span><span
class="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">percpu_size</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_TRACEPOINTS</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_tracepoints</span><spanclass="p">;</span>
<spanclass="k">struct</span><spanclass="n">tracepoint</span>
<spanclass="o">*</span><spanclass="k">const</span><span
class="o">*</span><spanclass="n">tracepoints_ptrs</span><span
class="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefHAVE_JUMP_LABEL</span>
<spanclass="k">struct</span><span
class="n">jump_entry</span><spanclass="o">*</span><span
class="n">jump_entries</span><spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_jump_entries</span><spanclass="p">;
</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_TRACING</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_trace_bprintk_fmt</span><spanclass="p">;
</span>
<spanclass="k">const</span><spanclass="kt">char</span>
<spanclass="o">**</span><span
class="n">trace_bprintk_fmt_start</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_EVENT_TRACING</span>
<spanclass="k">struct</span><span
class="n">trace_event_call</span><spanclass="o">**</span><span
class="n">trace_events</span><spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_trace_events</span><spanclass="p">;</span>
<spanclass="k">struct</span><span
class="n">trace_enum_map</span><spanclass="o">**</span>
<spanclass="n">trace_enums</span><spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_trace_enums</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdef
CONFIG_FTRACE_MCOUNT_RECORD</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_ftrace_callsites</span><spanclass="p">;
</span>
<spanclass="kt">unsigned</span><spanclass="kt">long</span>
<spanclass="o">*</span><spanclass="n">ftrace_callsites</span>
<spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_LIVEPATCH</span>
<spanclass="kt">bool</span><spanclass="n">klp</span><span
class="p">;</span><spanclass="cm">/*Isthisalivepatchmodule?
*/</span>
<spanclass="kt">bool</span><spanclass="n">klp_alive</span>
<spanclass="p">;</span>
<spanclass="cm">/*Elfinformation*/</span>
<spanclass="k">struct</span><span
class="n">klp_modinfo</span><spanclass="o">*</span><span
class="n">klp_info</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_MODULE_UNLOAD</span>
<spanclass="cm">/*Whatmodulesdependonme?*/</span>
<spanclass="k">struct</span><spanclass="n">list_head</span>
<spanclass="n">source_list</span><spanclass="p">;</span>
<spanclass="cm">/*WhatmodulesdoIdependon?*/</span>
<spanclass="k">struct</span><spanclass="n">list_head</span>
<spanclass="n">target_list</span><spanclass="p">;</span>
<spanclass="cm">/*Destructionfunction.*/</span>
<spanclass="kt">void</span><spanclass="p">(</span><span
class="o">*</span><spanclass="n">exit</span><spanclass="p">)
(</span><spanclass="kt">void</span><spanclass="p">);</span>
<spanclass="n">atomic_t</span><spanclass="n">refcnt</span>
<spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="cp">#ifdefCONFIG_CONSTRUCTORS</span>
<spanclass="cm">/*Constructorfunctions.*/</span>
<spanclass="n">ctor_fn_t</span><spanclass="o">*</span><span
class="n">ctors</span><spanclass="p">;</span>
<spanclass="kt">unsigned</span><spanclass="kt">int</span>
<spanclass="n">num_ctors</span><spanclass="p">;</span>
<spanclass="cp">#endif</span>
<spanclass="p">}</span><span
class="n">____cacheline_aligned</span><spanclass="p">;</span>
<spanclass="k">enum</span><span
class="n">module_state</span><spanclass="p">{</span>
<spanclass="n">MODULE_STATE_LIVE</span><span
class="p">,</span><spanclass="cm">/*Normalstate.*/</span>
<spanclass="n">MODULE_STATE_COMING</span><span
class="p">,</span><spanclass="cm">/*Fullformed,running
module_init.*/</span>
<spanclass="n">MODULE_STATE_GOING</span><span
class="p">,</span><spanclass="cm">/*Goingaway.*/</span>
<spanclass="n">MODULE_STATE_UNFORMED</span><span
class="p">,</span><spanclass="cm">/*Stillsettingitup.*/</span>
<spanclass="p">};</span>
Whileloadingorremovingamodule,it'simportanttoknowits
currentstate;forinstance,weneednotinsertanexistingmoduleifits
statespecifiesthatitisalreadypresent.
syms,crcandnum_syms:Theseareusedtomanagesymbolsthatare
exportedbythemodulecode.
init:Thisisthepointertoafunctionwhichiscalledwhenthe
moduleisinitialized.
arch:Thisrepresentsthearchitecturespecificstructurewhichshallbe
populatedwitharchitecture-specificdata,neededforthemodulesto
run.However,thisstructuremostlyremainsemptyasmost
architecturesdonotneedanyadditionalinformation.
taints:Thisisusedifthemoduleistaintingthekernel.Itcouldmean
thatthekernelsuspectsamoduletodosomethingharmfuloranon-
GPLcomplaintcode.
percpu:Thispointstoper-CPUdatabelongingtothemodule.Itis
initializedatthemoduleloadtime.
source_listandtarget_list:Thiscarriesdetailsonmodule
dependencies.
exit:Thissimplyistheoppositeofinit.Itpointstothefunctionthat
iscalledtoperformthecleanupprocessofthemodule.Itreleases
memoryheldbythemoduleanddoesothercleanupspecifictasks.
<spanclass="k">struct</span><span
class="n">module_layout</span><spanclass="p">{</span><span
class="cm">/*Theactualcode+data.*/</span><span
class="kt">void</span><spanclass="o">*</span><span
class="n">base</span><spanclass="p">;</span><span
class="cm">/*Totalsize.*/</span><span
class="kt">unsigned</span><spanclass="kt">int</span><span
class="n">size</span><spanclass="p">;</span><span
class="cm">/*Thesizeoftheexecutablecode.*/</span><span
class="kt">unsigned</span><spanclass="kt">int</span><span
class="n">text_size</span><spanclass="p">;</span><span
class="cm">/*SizeofROsectionofthemodule(text+rodata)
*/</span><spanclass="kt">unsigned</span><span
class="kt">int</span><spanclass="n">ro_size</span><span
class="p">;</span>
<spanclass="cp">#ifdef
CONFIG_MODULES_TREE_LOOKUP</span>
<spanclass="k">struct</span><span
class="n">mod_tree_node</span><spanclass="n">mtn</span>
<spanclass="p">;</span><spanclass="cp">#endif</span>
<spanclass="p">};</span>
Summary
Inthischapter,webrieflycoveredallthecoreelementsofmodules,its
implications,andmanagementdetails.Ourattempthasremainedtogiveyoua
quickandcomprehensiveviewofhowkernelfacilitatesitsextensibilitythrough
modules.Youalsounderstoodthecoredatastructuresthatfacilitatemodule
management.Kernel'sattemptatremainingsafeandsteadyinthisdynamic
environmentisalsoanotablefeature.
Ireallyhopethisbookservesasameansforyoutogooutthereandexperiment
morewithLinuxkernel!