Thread

Re: Remove invalid SS2/SS3 handling from EUC-KR routines

Henson Choi <assam258@gmail.com> — 2026-05-14T01:56:23Z
Hi Michael, SungJun,

One framing point first.

Correctness of the rewritten routines is the easy part -- for valid
EUC-KR data there is no behavior change.  What deserves scrutiny is
the side-effect surface: callers that have been getting mblen=2 or 3
for 0x8E/0x8F, and the fact that such bytes can already be on disk
via routes that skip pg_verify_mbstr.  After the patch those rows
are measured differently by length/substring/LIKE.

Suggested review principles:

  (P1) Indirect-entry completeness -- pg_wchar_table[] should be
       confirmed as the only callback registry into the EUC-KR slots.
  (P2) Direct-call review -- callers of pg_mblen / pg_dsplen /
       pg_encoding_mblen / pg_verify_mbstr / pg_mb*cliplen /
       pg_encoding_max_length and their cstring variants should be
       enumerated and each call site checked for any reliance on
       the old mblen=2/3 return for 0x8E/0x8F.
  (P3) Verify-bypass:
       (a) call-time -- code that runs mblen/dsplen on bytes not
           verified in the same call chain (copyto.c file_encoding
           scan, cstring pg_mblen in formatting.c / varlena.c) should
           be identified.
       (b) load-time -- byte-preserving carry-over that skips
           pg_verify_mbstr (pg_upgrade relfilenode preservation,
           physical replication / WAL replay, pre-patch on-disk data
           from earlier versions) should be reasoned about.
  (P4) Cross-encoding contamination -- it should be confirmed that no
       Japanese-shaped assumption (SS2/SS3, JIS X 0201/0212,
       pg_eucjp_increment, shared pg_euc_* helpers) can reach EUC-KR
       data after the split.
  (P5) User-visible width -- pg_encoding_max_length('EUC_KR') changes
       from 3 to 2; buffer-sizing / width-bounded consumers should be
       checked.

Review-coverage list (not a test matrix):

  Encoding-layer surface (P5)
  - pg_wchar_table[PG_EUC_KR].maxmblen and readers of
    pg_encoding_max_length() that switch on EUC-KR width
  - pg_database_encoding_max_length() callers in mbutils.c

  Direct callers of pg_mblen / pg_dsplen (P2)
  - varlena.c:    text_substring, text_position, text_left/right,
                  text_reverse, length-family
  - oracle_compat.c: lpad/rpad/ltrim/rtrim/btrim/translate
  - formatting.c: to_char width handling (pg_mblen_cstr)
  - varchar.c:    bpchar / varchar length enforcement
                  (pg_mbcharcliplen)
  - xml.c, jsonfuncs.c: pg_mblen on textual content

  Pattern matching (P2)
  - like.c: multibyte LIKE

  Call-time verify-bypass (P3a)
  - copyto.c: file_encoding delimiter scan
  - formatting.c / varlena.c: cstring pg_mblen callers

  Load-time verify-bypass (P3b) -- highest follow-up priority
  - pg_upgrade: relfilenode preservation (bytes as-is)
  - Physical replication / WAL replay
  - Pre-patch on-disk data from earlier versions

  Pre-existing-data behavior (P3b consequence)
  - length/char_length/octet_length, substring/left/right/position,
    LIKE / regex on rows containing 0x8E/0x8F
  - Sort order / btree text indexes built before the patch
    (bytewise comparator, likely safe but worth naming)

  tsearch (P2)
  - spell.c, ts_locale.c, wparser_def.c, dict_thesaurus.c

  Conversion proc sanity
  - utf8_and_euc_kr (table-driven, expected unchanged)

These are the items I think are worth a line or two of reasoning
from someone who has read the code, though I don't think they
need to block this patch -- treating the review-coverage list as
background and following up separately is a perfectly fine
outcome.  I plan to work through them myself gradually, as time
allows, and share what I find along the way; other eyes would
be welcome.

----------------------------------------------------------------------

On 2026-05-13, Michael Paquier wrote:

> I unfortunately cannot read the PNG file [...]

I read KS X 2901 clause 4.2 directly in Korean; SungJun's later
transcription matches the normative text verbatim.

(On src/test/locale/: agreed with SungJun -- a separate concern
from this patch.)

> - the proposed patch has zero regression tests.

Agreed on the coverage gap.  SungJun's proposed 0001 baseline
tests (pg_encoding_max_length, G0/ASCII handling, G1/KS X 1001
handling, SS2 rejection) are a sensible starting point and worth
adopting as-is.  Beyond that, the review-coverage list above is
what I plan to walk as follow-up: some items will naturally
suggest further regression tests, while others (notably the
load-time bypass paths in P3b) can only be reasoned about in
review.  Any additional tests can come out of that pass.

----------------------------------------------------------------------

On 2026-05-13, SungJun Jang wrote (v2):

> v2 attached. [commit message and release note updated]

Commit-message update LGTM.  On the release-19.sgml hunk: release
notes are curated by the release manager from commit messages
after the fact, not written into the patch.  It would probably
be cleaner for v3 to omit that hunk and let the commit message
carry the user-visible change instead (pg_encoding_max_length
3 -> 2, plus the pre-existing 0x8E/0x8F behavior shift via the
P3b routes).  The charset.sgml Table 23.3 fix should stay --
that is documented behavior, not a release note.

----------------------------------------------------------------------

On 2026-05-13, SungJun Jang wrote (reply to Michael):

> [KS X 2901 clause text quoted in Korean and English]

Confirmed -- both match standard.go.kr.

> I am planning to split into two patches:
>   0001 -- Baseline capture [...]
>   0002 -- Actual fix [...]

The split is a nice shape for review -- characterization-first
makes the behavior delta easy to see, and "exactly one line moves"
is a useful audit signal.  Whether to keep the split or squash at
commit time is a committer call.

Regards,
Henson Choi