diff mbox series

[bitbake-devel,V3] runqueue.py: fix PSI check logic

Message ID 20230803051916.1640841-1-Qi.Chen@windriver.com
State New
Headers show
Series [bitbake-devel,V3] runqueue.py: fix PSI check logic | expand

Commit Message

ChenQi Aug. 3, 2023, 5:19 a.m. UTC
From: Chen Qi <Qi.Chen@windriver.com>

The current logic is not correct because if the time interval
between the current check and the last check is very small, the PSI
checker is not likely to block things even if the system is heavy
loaded.

It's not good to calculate the value too often. So we change to a 1s
check. As a build will usually take at least minutes, using the 1s value
seems reasonable.

Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
---
 bitbake/lib/bb/runqueue.py | 29 ++++++++++++++---------------
 1 file changed, 14 insertions(+), 15 deletions(-)

Comments

Richard Purdie Aug. 3, 2023, 6:19 a.m. UTC | #1
On Thu, 2023-08-03 at 13:19 +0800, Chen Qi via lists.openembedded.org
wrote:
> From: Chen Qi <Qi.Chen@windriver.com>
> 
> The current logic is not correct because if the time interval
> between the current check and the last check is very small, the PSI
> checker is not likely to block things even if the system is heavy
> loaded.
> 
> It's not good to calculate the value too often. So we change to a 1s
> check. As a build will usually take at least minutes, using the 1s value
> seems reasonable.
> 
> Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
> ---
>  bitbake/lib/bb/runqueue.py | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)

I think the challenge with the 1s difference is that it will cause two
different behaviours.

The first is that when it limits pressure, it will keep limiting for at
least one second afterwards. That might not be too bad, it would
potentially just delay a build slightly.

The second issue is that when bitbake starts executing tasks, even if
the pressure of the system spikes, it won't be seen for one second and
bitbake will likely have executed BB_NUMBER_THREAD processes in that
time, likely pushing the pressure on the system well above threshold.

I suspect we're going to have to come up with something different to
handle these issues...

Cheers,

Richard
ChenQi Aug. 3, 2023, 6:38 a.m. UTC | #2
Thanks Richard. You're right. I'll send out V4 to remove the 1s window, this will make bitbake do the check every time it tries to spawn a new task.

Regards,
Qi

-----Original Message-----
From: Richard Purdie <richard.purdie@linuxfoundation.org> 
Sent: Thursday, August 3, 2023 2:19 PM
To: Chen, Qi <Qi.Chen@windriver.com>; bitbake-devel@lists.openembedded.org
Cc: MacLeod, Randy <Randy.MacLeod@windriver.com>
Subject: Re: [bitbake-devel][PATCH V3] runqueue.py: fix PSI check logic

On Thu, 2023-08-03 at 13:19 +0800, Chen Qi via lists.openembedded.org
wrote:
> From: Chen Qi <Qi.Chen@windriver.com>
> 
> The current logic is not correct because if the time interval between 
> the current check and the last check is very small, the PSI checker is 
> not likely to block things even if the system is heavy loaded.
> 
> It's not good to calculate the value too often. So we change to a 1s 
> check. As a build will usually take at least minutes, using the 1s 
> value seems reasonable.
> 
> Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
> ---
>  bitbake/lib/bb/runqueue.py | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)

I think the challenge with the 1s difference is that it will cause two different behaviours.

The first is that when it limits pressure, it will keep limiting for at least one second afterwards. That might not be too bad, it would potentially just delay a build slightly.

The second issue is that when bitbake starts executing tasks, even if the pressure of the system spikes, it won't be seen for one second and bitbake will likely have executed BB_NUMBER_THREAD processes in that time, likely pushing the pressure on the system well above threshold.

I suspect we're going to have to come up with something different to handle these issues...

Cheers,

Richard
ChenQi Aug. 3, 2023, 7:56 a.m. UTC | #3
After more investigation, I found the problem is a little complicated. If the time interval is very small, the result might be 0, resulting in any small number having no effect. This means, even if we set BB_PRESSURE_MAX_IO/CPU/MEMORY all to "1", there will be a few tasks running together throughout the whole build process. In theory, there should only be 1 task running under such setting.

So I've sent out V4, which blocks task spawning for at least 1s while checking the pressure every time if the task spawning is not blocked. This would improve the situation.

Regards,
Qi

-----Original Message-----
From: bitbake-devel@lists.openembedded.org <bitbake-devel@lists.openembedded.org> On Behalf Of Chen Qi via lists.openembedded.org
Sent: Thursday, August 3, 2023 2:38 PM
To: Richard Purdie <richard.purdie@linuxfoundation.org>; bitbake-devel@lists.openembedded.org
Cc: MacLeod, Randy <Randy.MacLeod@windriver.com>
Subject: Re: [bitbake-devel][PATCH V3] runqueue.py: fix PSI check logic

Thanks Richard. You're right. I'll send out V4 to remove the 1s window, this will make bitbake do the check every time it tries to spawn a new task.

Regards,
Qi

-----Original Message-----
From: Richard Purdie <richard.purdie@linuxfoundation.org>
Sent: Thursday, August 3, 2023 2:19 PM
To: Chen, Qi <Qi.Chen@windriver.com>; bitbake-devel@lists.openembedded.org
Cc: MacLeod, Randy <Randy.MacLeod@windriver.com>
Subject: Re: [bitbake-devel][PATCH V3] runqueue.py: fix PSI check logic

On Thu, 2023-08-03 at 13:19 +0800, Chen Qi via lists.openembedded.org
wrote:
> From: Chen Qi <Qi.Chen@windriver.com>
> 
> The current logic is not correct because if the time interval between 
> the current check and the last check is very small, the PSI checker is 
> not likely to block things even if the system is heavy loaded.
> 
> It's not good to calculate the value too often. So we change to a 1s 
> check. As a build will usually take at least minutes, using the 1s 
> value seems reasonable.
> 
> Signed-off-by: Chen Qi <Qi.Chen@windriver.com>
> ---
>  bitbake/lib/bb/runqueue.py | 29 ++++++++++++++---------------
>  1 file changed, 14 insertions(+), 15 deletions(-)

I think the challenge with the 1s difference is that it will cause two different behaviours.

The first is that when it limits pressure, it will keep limiting for at least one second afterwards. That might not be too bad, it would potentially just delay a build slightly.

The second issue is that when bitbake starts executing tasks, even if the pressure of the system spikes, it won't be seen for one second and bitbake will likely have executed BB_NUMBER_THREAD processes in that time, likely pushing the pressure on the system well above threshold.

I suspect we're going to have to come up with something different to handle these issues...

Cheers,

Richard
diff mbox series

Patch

diff --git a/bitbake/lib/bb/runqueue.py b/bitbake/lib/bb/runqueue.py
index 48788f4aa6..769709f2d1 100644
--- a/bitbake/lib/bb/runqueue.py
+++ b/bitbake/lib/bb/runqueue.py
@@ -179,6 +179,7 @@  class RunQueueScheduler(object):
                     self.prev_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1]
                     self.prev_pressure_time = time.time()
                 self.check_pressure = True
+                self.psi_exceeded = False
             except:
                 bb.note("The /proc/pressure files can't be read. Continuing build without monitoring pressure")
                 self.check_pressure = False
@@ -191,6 +192,10 @@  class RunQueueScheduler(object):
         BB_PRESSURE_MAX_{CPU|IO|MEMORY} are set, return True if above threshold.
         """
         if self.check_pressure:
+            now = time.time()
+            tdiff = now - self.prev_pressure_time
+            if tdiff < 1.0:
+                return self.psi_exceeded
             with open("/proc/pressure/cpu") as cpu_pressure_fds, \
                 open("/proc/pressure/io") as io_pressure_fds, \
                 open("/proc/pressure/memory") as memory_pressure_fds:
@@ -198,25 +203,19 @@  class RunQueueScheduler(object):
                 curr_cpu_pressure = cpu_pressure_fds.readline().split()[4].split("=")[1]
                 curr_io_pressure = io_pressure_fds.readline().split()[4].split("=")[1]
                 curr_memory_pressure = memory_pressure_fds.readline().split()[4].split("=")[1]
-                now = time.time()
-                tdiff = now - self.prev_pressure_time
-                if tdiff > 1.0:
-                    exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure
-                    exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) / tdiff > self.rq.max_io_pressure
-                    exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure
-                    self.prev_cpu_pressure = curr_cpu_pressure
-                    self.prev_io_pressure = curr_io_pressure
-                    self.prev_memory_pressure = curr_memory_pressure
-                    self.prev_pressure_time = now
-                else:
-                    exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) > self.rq.max_cpu_pressure
-                    exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) > self.rq.max_io_pressure
-                    exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) > self.rq.max_memory_pressure
+                exceeds_cpu_pressure =  self.rq.max_cpu_pressure and (float(curr_cpu_pressure) - float(self.prev_cpu_pressure)) / tdiff > self.rq.max_cpu_pressure
+                exceeds_io_pressure =  self.rq.max_io_pressure and (float(curr_io_pressure) - float(self.prev_io_pressure)) / tdiff > self.rq.max_io_pressure
+                exceeds_memory_pressure = self.rq.max_memory_pressure and (float(curr_memory_pressure) - float(self.prev_memory_pressure)) / tdiff > self.rq.max_memory_pressure
+                self.prev_cpu_pressure = curr_cpu_pressure
+                self.prev_io_pressure = curr_io_pressure
+                self.prev_memory_pressure = curr_memory_pressure
+                self.prev_pressure_time = now
+                self.psi_exceeded = exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure
             pressure_state = (exceeds_cpu_pressure, exceeds_io_pressure, exceeds_memory_pressure)
             if hasattr(self, "pressure_state") and pressure_state != self.pressure_state:
                 bb.note("Pressure status changed to CPU: %s, IO: %s, Mem: %s" % pressure_state)
             self.pressure_state = pressure_state
-            return (exceeds_cpu_pressure or exceeds_io_pressure or exceeds_memory_pressure)
+            return self.psi_exceeded
         return False
 
     def next_buildable_task(self):