v3-0001-added-option-to-select-number-of-direct-values-to.patch
text/x-patch
Filename: v3-0001-added-option-to-select-number-of-direct-values-to.patch
Type: text/x-patch
Part: 0
From 56cf14f282a99e5431daf05f97de454f82358a9a Mon Sep 17 00:00:00 2001
From: Hannu Krosing <hannuk@google.com>
Date: Sun, 6 Jul 2025 20:17:24 +0200
Subject: [PATCH v3] added option to select number of direct values to show,
added documentation, moved old general discussion to PostgreSQL wiki
---
doc/src/sgml/ref/pgtesttiming.sgml | 247 ++++++------------------
src/bin/pg_test_timing/pg_test_timing.c | 128 +++++++++---
2 files changed, 162 insertions(+), 213 deletions(-)
diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e0..4b03bf77487 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -32,6 +32,12 @@ PostgreSQL documentation
<para>
<application>pg_test_timing</application> is a tool to measure the timing overhead
on your system and confirm that the system time never moves backwards.
+ </para>
+ <para>
+ This utility is also used to determine if the <varname>track_io_timing</varname>
+ configuration parameter is safe to turn on.
+ </para>
+ <para>
Systems that are slow to collect timing data can give less accurate
<command>EXPLAIN ANALYZE</command> results.
</para>
@@ -59,6 +65,18 @@ PostgreSQL documentation
</listitem>
</varlistentry>
+ <varlistentry>
+ <term><option>-N <replaceable class="parameter">list_n_direct_values</replaceable></option></term>
+ <term><option>--list-n-direct-values=<replaceable class="parameter">list_n_direct_values</replaceable></option></term>
+ <listitem>
+ <para>
+ Specifies the number of direct ns values to list. We count first 1024 nanosecond values
+ to help in understanding the behaviour of the system clock but by default we show just
+ the first 10 that have non-zero count.
+ </para>
+ </listitem>
+ </varlistentry>
+
<varlistentry>
<term><option>-V</option></term>
<term><option>--version</option></term>
@@ -92,205 +110,65 @@ PostgreSQL documentation
<title>Interpreting Results</title>
<para>
- Good results will show most (>90%) individual timing calls take less than
- one microsecond. Average per loop overhead will be even lower, below 100
- nanoseconds. This example from an Intel i7-860 system using a TSC clock
- source shows excellent performance:
-
-<screen><![CDATA[
-Testing timing overhead for 3 seconds.
-Per loop time including overhead: 35.96 ns
-Histogram of timing durations:
- < us % of total count
- 1 96.40465 80435604
- 2 3.59518 2999652
- 4 0.00015 126
- 8 0.00002 13
- 16 0.00000 2
-]]></screen>
+ The output has four columns with rows showing a shifted-by-one log2(ns) histogram of timing calls.
+ This is not the classic log2(n+1) histogram as it counts zeros separately and then switches to
+ log2(ns) starting from value == 1.
</para>
-
<para>
- Note that different units are used for the per loop time than the
- histogram. The loop can have resolution within a few nanoseconds (ns),
- while the individual timing calls can only resolve down to one microsecond
- (us).
+ <itemizedlist spacing="compact">
+ <listitem><simpara>nanosecond value that is larger of equal of counted ns</simpara></listitem>
+ <listitem><simpara>percentage of values in this bucket</simpara></listitem>
+ <listitem><simpara>running percentage of values in this and previous buckets</simpara></listitem>
+ <listitem><simpara>count of timings in this bucket</simpara></listitem>
+ </itemizedlist>
</para>
-
- </refsect2>
- <refsect2>
- <title>Measuring Executor Timing Overhead</title>
-
<para>
- When the query executor is running a statement using
- <command>EXPLAIN ANALYZE</command>, individual operations are timed as well
- as showing a summary. The overhead of your system can be checked by
- counting rows with the <application>psql</application> program:
-
-<screen>
-CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
-\timing
-SELECT COUNT(*) FROM t;
-EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
-</screen>
+ The results below show that 99% of timing calls took between 8 and 31 nanoseconds (the first block)
</para>
-
<para>
- The i7-860 system measured runs the count query in 9.8 ms while
- the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
- processing just over 100,000 rows. That 6.8 ms difference means the timing
- overhead per row is 68 ns, about twice what pg_test_timing estimated it
- would be. Even that relatively small amount of overhead is making the fully
- timed count statement take almost 70% longer. On more substantial queries,
- the timing overhead would be less problematic.
+ The second block goes into more detail showing that the fastest calls took 13 ns and 99% took 18 ns or less.
</para>
- </refsect2>
-
- <refsect2>
- <title>Changing Time Sources</title>
<para>
- On some newer Linux systems, it's possible to change the clock source used
- to collect timing data at any time. A second example shows the slowdown
- possible from switching to the slower acpi_pm time source, on the same
- system used for the fast results above:
-
<screen><![CDATA[
-# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-tsc hpet acpi_pm
-# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
-# pg_test_timing
-Per loop time including overhead: 722.92 ns
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 15.09 ns
Histogram of timing durations:
- < us % of total count
- 1 27.84870 1155682
- 2 72.05956 2990371
- 4 0.07810 3241
- 8 0.01357 563
- 16 0.00007 3
+ <= ns % of total running % count
+ 0 0.0000 0.0000 0
+ 1 0.0000 0.0000 0
+ 3 0.0000 0.0000 0
+ 7 0.0000 0.0000 0
+ 15 78.2383 78.2383 155509081
+ 31 21.6825 99.9208 43096793
+ 63 0.0755 99.9963 150061
+ 127 0.0010 99.9972 1926
+ 255 0.0011 99.9983 2097
+ 511 0.0012 99.9995 2470
+ 1023 0.0001 99.9997 248
+ 2047 0.0002 99.9999 403
+ 4095 0.0001 99.9999 142
+ 8191 0.0000 100.0000 55
+ 16383 0.0000 100.0000 45
+ 32767 0.0000 100.0000 10
+ 65535 0.0000 100.0000 6
+
+Timing durations less than 1024 ns, first 10 values:
+ ns % of total running % count
+ 13 0.0560 0.0560 111398
+ 14 21.9167 21.9727 43562342
+ 15 56.2656 78.2383 111835341
+ 16 19.0465 97.2848 37857493
+ 17 1.5008 98.7856 2983064
+ 18 0.5467 99.3324 1086682
+ 19 0.2145 99.5469 426359
+ 20 0.1422 99.6891 282614
+ 21 0.0553 99.7444 110002
+ 22 0.0252 99.7696 50099
]]></screen>
</para>
- <para>
- In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
- takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
- of what's measured directly by this utility. That much timing overhead
- means the actual query itself is only taking a tiny fraction of the
- accounted for time, most of it is being consumed in overhead instead. In
- this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
- many timed operations would be inflated significantly by timing overhead.
- </para>
-
- <para>
- FreeBSD also allows changing the time source on the fly, and it logs
- information about the timer selected during boot:
-
-<screen>
-# dmesg | grep "Timecounter"
-Timecounter "ACPI-fast" frequency 3579545 Hz quality 900
-Timecounter "i8254" frequency 1193182 Hz quality 0
-Timecounters tick every 10.000 msec
-Timecounter "TSC" frequency 2531787134 Hz quality 800
-# sysctl kern.timecounter.hardware=TSC
-kern.timecounter.hardware: ACPI-fast -> TSC
-</screen>
- </para>
-
- <para>
- Other systems may only allow setting the time source on boot. On older
- Linux systems the "clock" kernel setting is the only way to make this sort
- of change. And even on some more recent ones, the only option you'll see
- for a clock source is "jiffies". Jiffies are the older Linux software clock
- implementation, which can have good resolution when it's backed by fast
- enough timing hardware, as in this example:
-
-<screen><![CDATA[
-$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-jiffies
-$ dmesg | grep time.c
-time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
-time.c: Detected 2400.153 MHz processor.
-$ pg_test_timing
-Testing timing overhead for 3 seconds.
-Per timing duration including loop overhead: 97.75 ns
-Histogram of timing durations:
- < us % of total count
- 1 90.23734 27694571
- 2 9.75277 2993204
- 4 0.00981 3010
- 8 0.00007 22
- 16 0.00000 1
- 32 0.00000 1
-]]></screen></para>
-
</refsect2>
-
- <refsect2>
- <title>Clock Hardware and Timing Accuracy</title>
-
- <para>
- Collecting accurate timing information is normally done on computers using
- hardware clocks with various levels of accuracy. With some hardware the
- operating systems can pass the system clock time almost directly to
- programs. A system clock can also be derived from a chip that simply
- provides timing interrupts, periodic ticks at some known time interval. In
- either case, operating system kernels provide a clock source that hides
- these details. But the accuracy of that clock source and how quickly it can
- return results varies based on the underlying hardware.
- </para>
-
- <para>
- Inaccurate time keeping can result in system instability. Test any change
- to the clock source very carefully. Operating system defaults are sometimes
- made to favor reliability over best accuracy. And if you are using a virtual
- machine, look into the recommended time sources compatible with it. Virtual
- hardware faces additional difficulties when emulating timers, and there are
- often per operating system settings suggested by vendors.
- </para>
-
- <para>
- The Time Stamp Counter (TSC) clock source is the most accurate one available
- on current generation CPUs. It's the preferred way to track the system time
- when it's supported by the operating system and the TSC clock is
- reliable. There are several ways that TSC can fail to provide an accurate
- timing source, making it unreliable. Older systems can have a TSC clock that
- varies based on the CPU temperature, making it unusable for timing. Trying
- to use TSC on some older multicore CPUs can give a reported time that's
- inconsistent among multiple cores. This can result in the time going
- backwards, a problem this program checks for. And even the newest systems
- can fail to provide accurate TSC timing with very aggressive power saving
- configurations.
- </para>
-
- <para>
- Newer operating systems may check for the known TSC problems and switch to a
- slower, more stable clock source when they are seen. If your system
- supports TSC time but doesn't default to that, it may be disabled for a good
- reason. And some operating systems may not detect all the possible problems
- correctly, or will allow using TSC even in situations where it's known to be
- inaccurate.
- </para>
-
- <para>
- The High Precision Event Timer (HPET) is the preferred timer on systems
- where it's available and TSC is not accurate. The timer chip itself is
- programmable to allow up to 100 nanosecond resolution, but you may not see
- that much accuracy in your system clock.
- </para>
-
- <para>
- Advanced Configuration and Power Interface (ACPI) provides a Power
- Management (PM) Timer, which Linux refers to as the acpi_pm. The clock
- derived from acpi_pm will at best provide 300 nanosecond resolution.
- </para>
-
- <para>
- Timers used on older PC hardware include the 8254 Programmable Interval
- Timer (PIT), the real-time clock (RTC), the Advanced Programmable Interrupt
- Controller (APIC) timer, and the Cyclone timer. These timers aim for
- millisecond resolution.
- </para>
- </refsect2>
</refsect1>
<refsect1>
@@ -298,6 +176,7 @@ Histogram of timing durations:
<simplelist type="inline">
<member><xref linkend="sql-explain"/></member>
+ <member><ulink url="https://wiki.postgresql.org/wiki/Pg_test_timing">General discussion about timing moved into PostgreSQL Wiki</ulink></member>
</simplelist>
</refsect1>
</refentry>
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index ce7aad4b25a..270ab41ea92 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -9,19 +9,25 @@
#include <limits.h>
#include "getopt_long.h"
+#include "port/pg_bitutils.h"
#include "portability/instr_time.h"
static const char *progname;
static unsigned int test_duration = 3;
+static unsigned int list_n_direct_values = 10;
+
+/* record duration in powers of 2 nanoseconds */
+static long long int histogram[32];
+
+/* record counts of first 128 durations directly */
+#define NUM_DIRECT 1024
+static long long int direct_histogram[NUM_DIRECT];
static void handle_args(int argc, char *argv[]);
static uint64 test_timing(unsigned int duration);
static void output(uint64 loop_count);
-/* record duration in powers of 2 microseconds */
-static long long int histogram[32];
-
int
main(int argc, char *argv[])
{
@@ -44,6 +50,7 @@ handle_args(int argc, char *argv[])
{
static struct option long_options[] = {
{"duration", required_argument, NULL, 'd'},
+ {"list-n-direct-values", required_argument, NULL, 'N'},
{NULL, 0, NULL, 0}
};
@@ -56,7 +63,7 @@ handle_args(int argc, char *argv[])
{
if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
{
- printf(_("Usage: %s [-d DURATION]\n"), progname);
+ printf(_("Usage: %s [-d DURATION] [-N <number of direct values>]\n"), progname);
exit(0);
}
if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
@@ -66,7 +73,7 @@ handle_args(int argc, char *argv[])
}
}
- while ((option = getopt_long(argc, argv, "d:",
+ while ((option = getopt_long(argc, argv, "d:N:",
long_options, &optindex)) != -1)
{
switch (option)
@@ -93,6 +100,28 @@ handle_args(int argc, char *argv[])
}
break;
+ case 'N':
+ errno = 0;
+ optval = strtoul(optarg, &endptr, 10);
+
+ if (endptr == optarg || *endptr != '\0' ||
+ errno != 0 || optval != (unsigned int) optval)
+ {
+ fprintf(stderr, _("%s: invalid argument for option %s\n"),
+ progname, "--list-n-direct-values");
+ fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname);
+ exit(1);
+ }
+
+ list_n_direct_values = (unsigned int) optval;
+ if (list_n_direct_values > NUM_DIRECT)
+ {
+ fprintf(stderr, _("%s: %s must be in range %u..%u\n"),
+ progname, "--list_n_direct_values", 0, NUM_DIRECT);
+ exit(1);
+ }
+ break;
+
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
progname);
@@ -111,7 +140,6 @@ handle_args(int argc, char *argv[])
exit(1);
}
-
printf(ngettext("Testing timing overhead for %u second.\n",
"Testing timing overhead for %u seconds.\n",
test_duration),
@@ -130,19 +158,19 @@ test_timing(unsigned int duration)
end_time,
temp;
- total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_MICROSEC(start_time);
+ cur = INSTR_TIME_GET_NANOSEC(start_time);
while (time_elapsed < total_time)
{
int32 diff,
- bits = 0;
+ bits;
prev = cur;
INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_MICROSEC(temp);
+ cur = INSTR_TIME_GET_NANOSEC(temp);
diff = cur - prev;
/* Did time go backwards? */
@@ -154,18 +182,21 @@ test_timing(unsigned int duration)
}
/* What is the highest bit in the time diff? */
- while (diff)
- {
- diff >>= 1;
- bits++;
- }
+ if (diff > 0)
+ bits = pg_leftmost_one_pos32(diff) + 1;
+ else
+ bits = 0;
/* Update appropriate duration bucket */
histogram[bits]++;
+ /* Update direct histogram of time diffs */
+ if (diff < NUM_DIRECT)
+ direct_histogram[diff]++;
+
loop_count++;
INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
+ time_elapsed = INSTR_TIME_GET_NANOSEC(temp);
}
INSTR_TIME_SET_CURRENT(end_time);
@@ -181,28 +212,67 @@ test_timing(unsigned int duration)
static void
output(uint64 loop_count)
{
- int64 max_bit = 31,
+ int max_bit = 31,
i;
- char *header1 = _("< us");
+ char *header1 = _("<= ns");
+ char *header1b = _("ns");
char *header2 = /* xgettext:no-c-format */ _("% of total");
- char *header3 = _("count");
+ char *header3 = /* xgettext:no-c-format */ _("running %");
+ char *header4 = _("count");
int len1 = strlen(header1);
int len2 = strlen(header2);
int len3 = strlen(header3);
+ int len4 = strlen(header4);
+ double rprct;
/* find highest bit value */
while (max_bit > 0 && histogram[max_bit] == 0)
max_bit--;
+ /* set minimum column widths */
+ len1 = Max(8, len1);
+ len2 = Max(10, len2);
+ len3 = Max(10, len3);
+ len4 = Max(10, len4);
+
printf(_("Histogram of timing durations:\n"));
- printf("%*s %*s %*s\n",
- Max(6, len1), header1,
- Max(10, len2), header2,
- Max(10, len3), header3);
-
- for (i = 0; i <= max_bit; i++)
- printf("%*ld %*.5f %*lld\n",
- Max(6, len1), 1l << i,
- Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count,
- Max(10, len3), histogram[i]);
+ printf("%*s %*s %*s %*s\n",
+ len1, header1,
+ len2, header2,
+ len3, header3,
+ len4, header4);
+
+ for (i = 0, rprct = 0; i <= max_bit; i++)
+ {
+ double prct = (double) histogram[i] * 100 / loop_count;
+
+ rprct += prct;
+ printf("%*ld %*.4f %*.4f %*lld\n",
+ len1, (1L << i) - 1,
+ len2, prct,
+ len3, rprct,
+ len4, histogram[i]);
+ }
+
+ printf(_("\nTiming durations less than %d ns, first %d values:\n"), NUM_DIRECT, list_n_direct_values);
+ printf("%*s %*s %*s %*s\n",
+ len1, header1b,
+ len2, header2,
+ len3, header3,
+ len4, header4);
+
+ for (i = 0, rprct = 0; i < NUM_DIRECT && list_n_direct_values ; i++)
+ {
+ double prct = (double) direct_histogram[i] * 100 / loop_count;
+
+ rprct += prct;
+ if (direct_histogram[i]) {
+ list_n_direct_values--;
+ printf("%*d %*.4f %*.4f %*lld\n",
+ len1, i,
+ len2, prct,
+ len3, rprct,
+ len4, direct_histogram[i]);
+ }
+ }
}
--
2.50.0.727.gbf7dc18ff4-goog