CPU Idle Time Management - The Linux Kernel Documentation (2023)

Copyright ©

© 2018 Intel Corporation


Rafael J. Wysocki <rafael.j.wysocki@intel.com>


Modern processors can often enter a state where execution of a program is suspended and instructions associated with that program are not fetched from memory or executed. These states areidleThe status of the processor.

Since some of the processor hardware is not being used when the processor is idle, going into the idle state typically reduces the processor's power consumption, so this is an opportunity for power conservation.

CPU idle time management is a power saving feature that uses the processor's idle states for this purpose.

logical CPU

The management of the CPU idle time runs on the CPU, e.gCPU-Scheduler(This is the part of the kernel that is responsible for distributing the workload in the system). From his point of view, it's the CPULogicallyUnit. That means they don't have to be separate physical entities, they can appear to the software simply as interfaces to separate single-core processors. In other words, a CPU is a unit that appears to fetch instructions from memory that belong to a sequence (program) and executes them, but doesn't have to physically function that way. In general, three different situations can be considered here.

First, if the entire processor can only execute one set of instructions (one program) at a time, it is a CPU. In this case, when the hardware is asked to go to idle, the entire processor is affected.

Second, if the processor has multiple cores, each of its cores can run at least one program at a time. The cores don't have to be completely independent of each other (they can share caches, for example), but most of the time they still work physically in parallel with each other. So if they're only running one program at a time, those programs are essentially running independently at the same time. In this case, the entire core is the CPU. When the hardware is asked to idle, this applies to the core that requested it first, but can also apply to larger units (such as "packs" or "clusters") of cores (in fact, it can apply to the entire hierarchy of larger units that contain the core apply.) That is, if all but one of the cores in the larger unit were idle at the "core level" and the remaining cores are asking the processor to idle, it may trigger, that the entire larger unit goes idle. Also affects other cores in the unit.

Finally, each core in a multicore processor may be able to run multiple programs in the same time frame (i.e. each core may be able to fetch instructions from multiple locations in memory and execute them in the same time frame, but not necessarily in parallel). In this case, the cores of the software present themselves as "bundles", with each bundle consisting of a number of individual single-core "processors".Hardware-Thread(or Hyperthreading specifically for Intel hardware) each thread can follow a sequence of instructions. From the point of view of CPU idle time management, the hardware threads are then the CPUs. If one of them asks the processor to idle, the hardware thread (or CPU) that requested it will stop, but nothing else happens, unless all other hardware threads within the same core also ask the processor to go idle. In this case, a core may be individually inactive, or the larger unit containing it may be inactive as a whole (when other cores within the larger unit are already inactive).

CPU idle

Logical CPUs, referred to as "CPUs" in the following, are taken into accountidleRun by the Linux kernel when nothing else can run on it other than special "idle" tasks.

Tasks are the representation of the work by the CPU scheduler. Each task consists of a sequence of instructions or code to be executed, the data to be manipulated when that code is executed, and some context information that needs to be loaded into the processor each time the CPU executes the task's code. The CPU scheduler distributes work by assigning tasks to be performed to the CPUs present in the system.

Tasks can have different states. In particular, they areexecutableWhen there is no specific condition that prevents their code from being executed by CPU while CPU is available (e.g. they are not waiting for an event to occur or similar). When a task becomes executable, the CPU scheduler assigns it to run on one of the available CPUs. If no other executable tasks are assigned to it, the CPU loads the context of the given task and executes its code (starts well after the last execution). instruction, possibly another CPU). [When a CPU has multiple executable tasks assigned at the same time, they are prioritized and time-split to allow them to make some progress over time. ]

When a particular CPU has no other executable tasks assigned, a special "idle" task becomes executable and the CPU is then considered idle. In other words, on Linux, idle CPUs execute code called "idle" tasksidle loop.This code may cause the processor to go into one of its sleep states (if supported) to conserve power, but the processor does not support sleep, or does not remain in the sleep state enough time before the next wake-up event, or there are strict latency limits, which prevent every available sleep state from being used, and the CPU simply executes more or less useless instructions in a loop until it is assigned a new task to execute.

idle loop

The idle loop code performs two main steps in each of its iterations. First it calls a function calledGovernorBelongs to a subsystem called CPU idle time managementidleSelect a sleep state for the CPU to request hardware entry. Second, it changes fromidlesubsystem invokeddriver, which actually requires the processor hardware to enter a governor-selected sleep state.

The governor's job is to find the hibernation state that best suits the conditions at hand. To do this, the idle states that the logical CPU needs the hardware to enter are represented in an abstract way, independent of the platform or processor architecture, and organized in a one-dimensional (linear) array. The array must consist ofidleDrivers corresponding to the platform that the kernel is running on initialization. This allowsidleThe governor is independent of the underlying hardware and works with any platform that the Linux kernel can run on.

Each hibernation present in this array is characterized by two parameters that must be considered by the governor, vizdestination residenceand (worst case)exit delay.The target dwell time is the minimum time the hardware must spend in a given state, including the (possibly long) time it takes to get into that state to save more power than going into one of the shallower sleep states changes. [The "depth" of a sleep state is roughly equal to the power the processor is consuming in that state. ] The exit latency is the maximum time it takes for the CPU to request the processor hardware to enter a sleep state to begin executing the first instruction after waking up from that state. Note that the exit delay usually also needs to cover the time required to get to a given state in case a wake occurs when the hardware enters a given state, and it needs to be fully entered to to be terminated properly.

There are two types of information that can influence the governor's decision. First, the governor knows the time of the next timer event. This time is known precisely because the kernel programs timers and knows exactly when they are triggered. This is the maximum amount of time a given CPU can be idle depending on the hardware, including the time it takes to join and exit. However, the CPU can be woken up by a non-timer event at any time (especially before the last timer fires) and often has no idea when this is happening. The governor can only see how much time the CPU has actually been idle after waking up (this time is called).waiting periodfrom now on) and it could somehow use that information along with the time until the next timer to estimate future idle times. How the governor uses this information depends on the implemented algorithm. This is the main reason for using multiple governors in more than oneidleSubsystem.

there are fouridlegovernor available,Menu,ethylene oxide,LeiterAndstop choosing.Which one is used by default depends on the configuration of the kernel, in particular whether schedulertick is availablestopped by idle loop.The available governors can be read outavailable_governors, and the governor can be changed at runtime. The nameidleThe governor currently used by the kernel can be obtained fromcurrent_governor_roorcurrent governorunder the file/sys/devices/system/cpu/cpuidle/existsystem file system.

whichidleOn the other hand, driver usage usually depends on the platform the kernel is running on, but some platforms have multiple matching drivers. For example, there are two drivers that work with most Intel platforms:intel_idleAndacpi_idle, one with hard-coded hibernation information, and the other can read that information from the system's ACPI tables, respectively. However, even in these cases, the driver selected during system initialization cannot be later replaced, so an early decision to use one of these drivers (on Intel platforms) must be madeacpi_idledriver is used whenintel_idleDisabled for some reason or could not detect the processor. The nameidleThe drivers currently used by the kernel can be downloaded fromcurrent driverunder the file/sys/devices/system/cpu/cpuidle/existsystem file system.

CPU and scheduler ticks idle

The scheduler tick is a timer that is triggered periodically to implement the CPU scheduler's time-sharing strategy. When a single CPU is assigned multiple workable tasks at the same time, the only way to make reasonable progress in a given time frame is to share the available CPU time. That is, as a rough approximation, each task is allocated a portion of CPU time to execute its code, depending on scheduling class, priority, etc., and when that portion of time is up, the CPU should switch to executing (code). Task. However, currently running tasks may not want to actively free up the CPU, and the scheduler switches to it, which will happen anyway. That's not the only thing thetick does, but it's the main reason to use it.

The scheduler tick is problematic from the point of view of CPU idle time management because it is triggered periodically and relatively frequently (the tick period length is between 1 ms and 10 ms, depending on the kernel configuration). Therefore, if ticks are allowed to be triggered on idle CPUs, there is no point in the hardware going into an idle state where the target stays longer than the tick period duration. Also, in this case, no CPU will be idle longer than the tick period length, and the energy expended entering and exiting idle due to tick wakeups on idle CPUs is wasted.

Luckily, there isn't really a need for ticks to be triggered on idle CPUs, since they (by definition) don't have to perform any tasks other than special "idle" tasks. In other words, from the CPU scheduler's point of view, idle loops are the only consumers of CPU time. Because an idle CPU's time does not need to be divided among multiple workable tasks, the main reason for using ticks when a given CPU is idle is eliminated. In principle, it is therefore possible to completely stop the scheduler when the CPUs are inactive, even if this effort is not always worthwhile.

Whether it makes sense to stop the scheduler ticks in an idle loop depends on the governor's expectations. First, if there is another (non-ticking) timer that is triggered because it is within the tick range, stopping the tick would obviously be a waste of time, although in that case reprogramming of the timer hardware might not be necessary . Second, stopping ticks is not necessary and can even be harmful if the governor expects untimed arousals to occur within tick range. That is, in this case, the governor chooses a sleep state with a target dwell time in the time before expected awakening, so the state is relatively flat. At this point, the manager really couldn't decide on deep sleep as it would contradict his own expectations of a quick wake up. Now, if the wake up is really fast, stopping the ticks would be a waste of time. In this case, the timer hardware would have to be reprogrammed, which is expensive. On the other hand, if the tick stops and the wakeup doesn't happen soon, the hardware could spend indefinitely in the shallow sleep state chosen by the governor, which would be a waste of energy. Therefore, if the governor expects activation to occur within the tick area, it is best to allow tick triggering. However, otherwise the governor will choose a relatively deep sleep state, so ticks should be stopped to avoid waking up the CPU prematurely.

In any case, the governor knows what to expect and the decision to stop the scheduler tick is up to him. However, if the tick has stopped (in a previous iteration of the loop) it is better to leave it as is and the governor must take this into account.

The kernel can be configured to completely disable scheduler ticks in stuck idle loops. This can be done via build-time configuration (by disabling theCONFIG_NO_HZ_IDLEconfig option) or vianohz=offin the command line. In both cases, since stopping the scheduler tick is disabled, the idle loop code simply ignores the governor's decision about it and the tick never stops.

The system running the kernel is configured to allow the scheduler to tick-stop on idle CPUsdoesn't itchSystems are generally considered to be more energy efficient than systems running a kernel that cannot stop ticking. If the specified system is tickless, it will be usedMenuGovernor by default, or tickless if not ticklessidleThe governor will do itLeiter.


TheMenuGovernor is the defaultidleGovernor of the tickless system. It's very complex, but the basics of its design are simple. That is, when called upon to select an idle state for the CPU (i.e., an idle state that the CPU allows the processor hardware to enter), it attempts to predict the duration of the idle and uses the predicted value of the idle state selector.

It first determines the time of the next timer event and assumes that the scheduler tick will stop. then calledsleep lengthNext, there is a cap on the time until the next CPU wakeup. It is used to determine the sleep length range, which in turn is required to determine the sleep length correction factor.

TheMenuThe governor maintains two sets of sleep length correction factors. Use one when a task previously running on a specific CPU is waiting for an I/O operation to complete, and use the other when it is not. Each array contains multiple correction factor values ​​that correspond to different sleep length ranges of tissue, such that each range represented in the array is approximately 10 times wider than the previous range.

The correction factor for a given range of sleep lengths (determined before the CPU sleep mode is selected) is updated after the CPU wakes up. The closer the sleep length is to the observed rest duration, the closer the correction factor is to 1 (it must be between (0 and 1, inclusive) 0 and 1). Sleep duration is multiplied by the correction factor for the range it falls within to get a first approximation of the predicted sleep duration.

Next, the governor uses a simple pattern recognition algorithm to improve its idle duration predictions. That is, it stores the last 8 observed idle time values ​​and calculates their mean and variance the next time idle time is predicted. A mean is considered a "typical range" value if the variance is small (less than 400 square milliseconds) or small relative to the mean (mean greater than six times the standard deviation). Otherwise, the longest stored value for the observed idle duration is discarded and the calculation is repeated for the remaining values. If their variance is small (in the above sense), take the mean as the value of the "typical interval" and so on until the "typical interval" is determined or too many data points are ignored. In this case, the "typical interval" is ignored " equals "infinity" (the largest unsigned integer value). The "typical interval" so calculated is compared to the sleep duration multiplied by the correction factor and the minimum of both is used as the predicted sleep duration.

The governor then calculates an additional latency limit to support "interactive" workloads. It is based on the observation that if the exit latency of a selected sleep state is comparable to the predicted sleep duration, the total time spent in that state can be very short and the energy saved by entering that state can be relatively small. So it's probably best to avoid the overhead associated with entering and exiting this state. So choosing a lighter state might be a better choice. A first approximation of the additional latency limit is the predicted idle time itself, additionally divided by a value that depends on the number of tasks that were previously running on a given CPU and are now waiting for I/O operations to complete. Compare the result of this division with the energy management quality of service latency limits, orPM service quality,Framework and the minimum of the two as the threshold for the delay when exiting sleep mode.

Now the governor is ready to go through the list of free states and choose one of them. To do this, it compares each state's target residence to the predicted idle time and its initial latency to the calculated latency limit. It selects the state where the target is closest to but still below the predicted idle time and the exit delay does not exceed the threshold.

As a final step, the governor may need to refine the hibernate selection if they haven't already made a decisionStop the ticking of the planner.This happens when the predicted idle time is less than the tick period and the tick was not stopped (in the previous iteration of the idle loop). However, the sleep duration used in previous calculations may not reflect real time until the next timer event occurs, and if in fact it is longer, the governor may need to choose a flatter state with a suitable target residency.

Timer Event Oriented (TEO) governor

An alternative is the Timer Event Oriented (TEO) governoridleGovernor for tickless systems. it follows withMenu like: It always tries to find the deepest sleep state that is suitable for the given conditions. However, the problem is approached differently.

The idea for this governor is based on the observation that on many systems timer events occur two or more orders of magnitude more frequently than all other interrupts and are therefore probably the most important reason for the CPU waking up from idle. Also, information about what happened in the (relatively recent) past can be used to estimate the deepest sleep state the target is in during the (known) time to the next timer event, called the sleep length, which is required for the coming time might be appropriate CPU idle period, if not, which flatter idle state is chosen as a replacement.

Of course, in some use cases, non-timer wake sources are more important, which can be covered by considering some current CPU idle intervals. Even in this case, however, there is no need to consider idle duration values ​​greater than sleep duration, since the next timer will eventually wake the CPU anyway, unless it was woken up earlier.

Therefore, the governor estimates whether the expected CPU idle time is likely to be significantly shorter than the sleep time and selects an idle state for the CPU accordingly.

The calculations performed by this governor are based on containers whose usage limits match the target CPU idle state residency parameter valueidleThe drivers are sorted in ascending order. That is, the first class ranges from 0 to the target dwell times for the second idle state (sleep state 1), where the second class ranges from idle state 1 to the target dwell times for the idle state 2, with the third class not including the target dwell times target rest state residence 2, but excludes the target residence at rest 3, etc. The last range goes from the target of the deepest sleep state provided by the driver to infinity.

There are two metrics associated with each bin called Hits and Intercepts. They are updated each time before a hibernate is selected for a given CPU based on what happened last time.

The "hit" metric reflects the relative frequency with which the measured sleep length and idle duration fall within the same range after the CPU wakes up (i.e., the CPU appears to wake up "on time" relative to the sleep length). The "intercept" metric, in turn, reflects the relative frequency of instances where the measured idle duration is so much shorter than the sleep duration that it falls within ranges corresponding to idle states that are flatter than the ranges into which the sleep duration falls (these cases are). hereinafter referred to as "Interception").

In addition to the above metrics, the governor also counts the number of recent intercepts (i.e. when the last interception occurred).NR_recentname it for a specific CPU) for each bin.

To choose a sleep state for the CPU, the governor performs the following steps (modulo also possible latency limitations that need to be considered):

  1. Find the deepest CPU idle state (candidate idle state) where the target dwell time does not exceed the current idle state time and calculate the three sums as follows:

    • The sum of the "hit" and "catch" metrics for the candidate state and any deeper sleep states (representing cases where the CPU was idle long enough to avoid catching when the sleep duration was equal to the current sleep duration) .

    • The sum of the "catch" metrics for all idle states that are shallower than the candidate state (this represents a case where the CPU has not been idle long enough to avoid catching if the idle duration equaled the current state) .

    • The sum of the most recent wiretaps of all inactive states is lower than that of the candidate states.

  2. If the second sum is greater than the first or the third sum is greater thanNR_recent/2 will likely wake the CPU early, so look for an alternate sleep state to choose from.

    • In descending order, traverse inactive states that are shallower than the candidate states.

    • For each of them, calculate the sum of the "intercept" metric and the sum of the number of last intercepts for all idle states between it and the candidate (including the former, but not the latter).

    • If each of these totals to consider (since the associated checks have shown that the CPU is likely to increase early) is greater than half the corresponding total calculated in step 1 (meaning the target is in more than half). In relevant cases, if the status of the issue does not exceed the idle time), the specified idle status will be selected instead of the candidate status.

  3. The candidate status is selected by default.


The idea behind the Util-Awareness extension is that there are two different scenarios for the CPU that result in two different methods of selecting sleep - used and unused.

In this case, "busy" means that the average CPU utilization of the execution queue is above a certain threshold.

If the CPU is busy and goes into a sleep state, it's likely to wake up quickly to do more work. Therefore, flat sleep states should be chosen to minimize latency and maximize performance. Typically, when the CPU is not in use, a metrics-based approach to selecting the deepest available sleep state should be preferred to achieve power savings.

To achieve this, the governor uses occupancy thresholds. The threshold is calculated as a percentage of the capacity of each CPU to the CPU capacity by shifting the capacity value. Based on testing, a shift of 6 (~1.56%) seems to give the best results.

Before the governor selects the next sleep state, it compares the current CPUutil to a pre-calculated util threshold. If lower, the TEO indicator mechanism will be used by default. When enabled, the closest, shallower rest state is chosen as long as it is not a polled state.

Representation of the idle state

CPU idle time management requires all physical idle states supported by the processor to be represented as a one-dimensional arrayStructure cpuidle_stateEach object allows a single (logical) CPU to request the processor hardware to go into an idle state of certain properties. If there is a unit hierarchy in the processor, aStructure cpuidle_stateObjects can cover combinations of rest states supported by cells at different levels of the hierarchy. in this case,His target parameters for length of stay and exit delay, must reflect the properties of the quiescent state at the deepest level (i.e. the quiescent state of the cell containing all other cells).

For example, let's take a processor with two cores in a larger unit called a "module" and assume that the hardware's requirement at the "core" level that a core enters a specific idle state (eg. If the core is already in sleep state "X", it tries to enter its own specific sleep state (eg "MX"). In other words, requesting hibernation "X" at the "core" level allows the hardware to enter hibernation "MX" at the "module" level, but there is no guarantee that this will happen (a core requirement for hibernation " X" may simply end in state). Then the target liesStructure cpuidle_stateThe object representing the "X" sleep state must reflect the minimum time the module has been in the "MX" sleep state (including the time required to enter the sleep state), as this is the minimum time that the CPU needs to be idle to save power in case the hardware enters this state. Likewise, the exit delay parameter of this object must cover the exit time of the module's sleep state "MX" (and usually its arrival time), since this is the maximum delay between the wake-up signal and the time when the CPU starts executing the first new instruction (assuming that both cores in the module are always ready to execute instructions once the module has been executed as a whole).

However, some processors do not have direct coordination between different levels of their internal unit hierarchy. In these cases, requesting hibernation at the "core" level does not automatically affect the "module" level, e.g. B. in any way withidleThe driver is responsible for processing the entire hierarchy. The definition of a sleep state object is then entirely up to the driver, but the physical characteristics of the sleep state that the processor hardware ultimately enters must still match the parameters used by the governor to select the sleep state (e.g. the actual exit delay from) . This hibernation must not exceed the Exit Delay parameter for the selected hibernation object.

In addition to the target resident and exit delay hibernation parameters discussed above, objects representing hibernation states each contain some other parameters describing the hibernation state and a pointer to a function to be performed to request the hardware to enter that state. Additionally for everyoneStructure cpuidle_stateobject, with correspondingStructure cpuidle_state_usageOne of these contains usage statistics for a specific hibernation state. This information is passed by the kernelsystem file system.

There is one for each CPU in the system/sys/devices/system/cpu/cpu/cpuidle/directory insystem file system, where the numberAssigned to a specific CPU at initialization time. This directory contains a number ofstate 0,state 1And so on, until the number of idle state objects defined for a given CPU is reduced by one. Each of these directories corresponds to an idle state object. The higher the number in its name, the deeper (effectively) idle state it represents. Each of them contains some files (properties) representing the properties of the corresponding object in idle state as follows:

more than

The total number of times this idle state was requested, but the observed idle time was definitely too short to match the target dwell time.

the following

The total number of times this sleep state was requested, but of course a deeper sleep state would better match the observed sleep duration.


A description of the hibernation state.

Disabled person

Whether this hibernation is disabled.

default state

The default status for this status, either "enabled" or "disabled".


Delay in exiting the idle state in microseconds.


The name of the hibernation.


Power consumed by the hardware in this sleep state in milliwatts (0 if specified).


The target dwell time in microseconds for the sleep state.


The total time in microseconds that each CPU (measured by the kernel) spent in this idle state.


The total number of times each CPU has asked the hardware to go to sleep.


The total number of requests to enter this sleep state were denied on a given CPU.

ThedescribeAndNameThe files all contain character strings. The difference is that the name should be more concise while the description can be longer and can contain spaces or special characters. The other files listed above contain integers.

TheDisabled personAttributes are writable only. If it contains 1, the specified hibernation is disabled for that particular CPU, meaning the governor will never pick it for that particular CPU, andidleTherefore, the driver never asks the hardware to enter it for that CPU. However, disabling idle for one CPU does not prevent other CPUs from requesting it. It must therefore be deactivated for all CPUs so that it is not requested by any of these CPUs. [Please note that dueLeiterThe governor implemented that disabling hibernation prevents the governor from choosing a hibernation deeper than the disabled state. ]

IfDisabled personIf the attribute contains 0, a particular sleep mode is enabled for that particular CPU, but it may be disabled for some or all other CPUs in the system at the same time. Writing a 1 will cause hibernation to be disabled for that particular CPU, while writing a 0 will allow the governor to honor that for the given CPU and the driver will request it unless the status is in the driver globally disabled (in this case it cannot be used at all).

TheStrengthProperties are not well defined, particularly for idle state objects that represent combinations of idle states at different levels of the processor unit hierarchy, and it is often difficult to obtain idle state performance figures for complex hardwareStrengthUsually contains 0 (not available). If it contains a non-zero number, the number is probably not very accurate and should not be relied on as a reliable source of meaningful information.

numbers inTimeThe file can often be larger than the total time a given CPU actually spends in a given sleep state, as it is measured by the kernel and is unlikely to include instances where the hardware refuses to enter that sleep state and into a shallower sleep state transitions state instead (does not transition to a sleep state at all). The kernel can only measure the amount of time between asking the hardware to go idle and then waking up the CPU. However, he cannot say what is actually happening at the hardware level during this time. Furthermore, if the idle state objects in question represent idle states at different levels of the cell hierarchy in the subsequent combined processor, the kernel can never tell how deep in the hierarchy the hardware is in a given case. For these reasons, the only reliable way to find out how much time the hardware has spent in the various supported sleep states is to use the hardware sleep counters (if available).

Typically, an interrupt received when trying to go to sleep will result in the request to go to sleep being denied in that caseidleDrivers may return error codes to indicate this is the case. TheuseAndrejectThe file counts the number of times a particular sleep state was successfully entered or denied.

CPU power management quality of service

The Power Management Quality of Service (PM QoS) framework in the Linux kernel allows kernel code and user-space processes to limit various power-efficiency features of the kernel to prevent performance from falling below desired levels.

CPU idle time management can be affected by PM-QoS in two ways: global CPU latency limits and individual CPU recovery latency limits. Kernel code (e.g. device drivers) can set them using special internal interfaces provided by the PM QoS framework. Userspace can be accessed by openingcpu_dma_latencyspecial device file under/Development/and write a binary value (interpreted as a 32-bit signed integer) into it. Conversely, the CPU's recovery latency limits can be changed from user space by writing a string (representing a 32-bit signed integer) topower/pm_qos_resume_latency_usunder the file/sys/devices/system/cpu/cpu/existsystem file system, where the CPU numberAssigned at system initialization. In both cases negative values ​​are rejected and in both cases the written integer is interpreted as the requested PM QoS restriction in microseconds.

However, the requested value is not automatically applied as the new constraint because it may be less restrictive (larger in this particular case) than another constraint previously requested by someone else. For this reason, the PM-QoS framework keeps a list of the requests made so far against the global CPU latency limit and each individual CPU, aggregates them and applies the effective (minimum in this particular case) value as new restrictions.

Actually opencpu_dma_latencyThe special device file causes a new PM QoS request to be created and added to the global priority list of CPU latency limited requests represented by the file descriptor from the "Open" operation. When this file descriptor is subsequently used for writing, the number written to it is associated with the PM QoS request it represents as the new request limit. Next, the priority list mechanism is used to determine a new effective value for the entire request list, and this effective value is set as the new CPU latency limit. So requesting a new limit will result in a change in the actual limit only if the valid "list" value is affected, which is the case if it is the minimum of the requested values ​​in the list.

The process that contains the file descriptor obtained when it was openedcpu_dma_latencyA special device file controls the PM QoS requirements associated with that file descriptor, but controls only that specific PM QoS requirement.

closurecpu_dma_latencyA special device file, or more precisely the file descriptor obtained when it is opened, causes the PM-QoS request associated with this file descriptor to be removed from the global priority list of CPU latency-limited requests and destroyed. In this case, the priority list mechanism is used again to determine a new valid value for the entire list, which becomes the new limit.

In turn, for each CPU there is a PM QoS request with recovery latencypower/pm_qos_resume_latency_usunder the file/sys/devices/system/cpu/cpu/existsystem file systemand writing to it will result in that single PM QoS request being updated no matter what userspace process is doing it. In other words, this PM-QoS request is shared by the whole user space, so access to the associated file must be decided to avoid confusion. [Probably the only legitimate use of this mechanism in practice is to pin a process to the CPU in question and let it use itsystem file systemInterface to control the recovery delay limitations. ] However, it is still just a request. It is an entry in the priority list that determines the effective value set as the recovery latency constraint for the corresponding CPU each time the request list is updated in one way or another (this list code can contain other requests from cores come).

CPU idle time managers should consider the minimum of the global (effective) CPU latency limit and the effective resume latency limit for a given CPU as the upper bound on the exit latency allowed for the idle state they choose for that CPU. You should not exit a sleep state with a delay that exceeds this limit.

Control hibernation from the kernel command line

Apart fromsystem file systemThe interface allows individual idle statesDisabled for individual CPUsthere are kernel command line parameters that affect how CPU idle time is managed.

Thecpuidle.off=1CPU idle time management can be completely disabled with a kernel command line option. It doesn't prevent idle loops from running on idle CPUs, but it does prevent calls to CPU idle timers and drivers. When the idle loop is added to the kernel command line, it requests the hardware via CPU architecture support code to enter an idle state on an idle CPU, which should provide a standard mechanism for this purpose. However, this default mechanism is typically the least common, and any processor that implements this architecture (i.e., the CPU instruction set) tends to be crude and not very power efficient. Therefore, it is not recommended for production use.

Thecpuidle.governor=The kernel command line switch allows thisidleThe governor to use is specified. A string must be appended that matches the name of an available governor (e.g.cpuidle.governor = Menü) and thatgovernor is used instead of the default. can be forcedMenuGovernors are used in systems that use:LeiterGovernor defaults this way, e.g.

The other kernel command line parameters described below that control CPU idle time management are only compatible withx86architecture and referencesintel_idleApplies to Intel processors only.

Thex86The architecture support code recognizes three kernel command line options related to CPU idle time management:idle = query,idle = stop,Andidle = waiting.with the first two disabledacpi_idleAndintel_idledriver overall, which effectively makes the wholeidleThe subsystem is disabled, causing the idle loop to call architecture-supported code to manage the idle CPU. How this is done depends on which of the two parameters is added to the kernel command line. Insideidle = stopIn this case, the architecture support code uses theConstant force platformTo do this, an instruction can be given to the CPU (which usually interrupts program execution and causes the hardware to attempt to enter the shallowest sleep state available).idle = queryWhen used, an idle CPU executes more or less "lightweight" instruction sequences in a tight loop. [Note the use ofidle = queryA bit extreme in many cases, as preventing an idle CPU from conserving almost all power may not be the only effect. For example, on Intel hardware, it effectively prevents the CPU from using P-states (seeCPU performance scaling) requires any number of CPUs in a package to be idle, so single-threaded computing performance and power efficiency are likely to be impacted. So it's probably not a good idea at all to use it for performance reasons. ]

Theidle = waitingOption prevents the use ofWaitThe instruction that the CPU goes to sleep. If you use this option,acpi_idleThe driver will useConstant force platforminstruction insteadWait.On systems with Intel processors, this option is disabledintel_idleDrivers and mandatory useacpi_idlethe driver instead. Note that in both casesacpi_idleA driver will only run if all the information it needs is contained in the system's ACPI tables.

In addition to the architecture-level kernel command-line options that affect CPU idle time management, there are parameters that affect individual parametersidleDrivers that can be passed to them from the kernel command line. Special,intel_idle.max_cstate=Andprocessor.max_cstate=Parameter, wois an index to the hibernation state, which is also used for the name of the specified state directorysystem file system(seeRepresentation of the idle state), lead tointel_idleAndacpi_idleThe driver separately discards all idle states that are deeper than the idle state.In that case they would never claim any of these idle states or surrender them to the governor. [The behavior of the two drivers is differenteven0.Add tointel_idle.max_cstate=0type in the kernel command line to disableintel_idledriver and allowacpi_idleis used duringprocessor.max_cstate=0equivalentprocessor.max_cstate=1.Additionally,acpi_idleThe driver is part of itprocessorKernel modules that can be loaded individually andmax condition =Can be passed to it as a modulargument on load. ]

Top Articles
Latest Posts
Article information

Author: Kelle Weber

Last Updated: 02/04/2023

Views: 5990

Rating: 4.2 / 5 (53 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Kelle Weber

Birthday: 2000-08-05

Address: 6796 Juan Square, Markfort, MN 58988

Phone: +8215934114615

Job: Hospitality Director

Hobby: tabletop games, Foreign language learning, Leather crafting, Horseback riding, Swimming, Knapping, Handball

Introduction: My name is Kelle Weber, I am a magnificent, enchanting, fair, joyous, light, determined, joyous person who loves writing and wants to share my knowledge and understanding with you.