Thread

  1. Re: Remove invalid SS2/SS3 handling from EUC-KR routines

    Henson Choi <assam258@gmail.com> — 2026-05-14T01:56:23Z

    Hi Michael, SungJun,
    
    One framing point first.
    
    Correctness of the rewritten routines is the easy part -- for valid
    EUC-KR data there is no behavior change.  What deserves scrutiny is
    the side-effect surface: callers that have been getting mblen=2 or 3
    for 0x8E/0x8F, and the fact that such bytes can already be on disk
    via routes that skip pg_verify_mbstr.  After the patch those rows
    are measured differently by length/substring/LIKE.
    
    Suggested review principles:
    
      (P1) Indirect-entry completeness -- pg_wchar_table[] should be
           confirmed as the only callback registry into the EUC-KR slots.
      (P2) Direct-call review -- callers of pg_mblen / pg_dsplen /
           pg_encoding_mblen / pg_verify_mbstr / pg_mb*cliplen /
           pg_encoding_max_length and their cstring variants should be
           enumerated and each call site checked for any reliance on
           the old mblen=2/3 return for 0x8E/0x8F.
      (P3) Verify-bypass:
           (a) call-time -- code that runs mblen/dsplen on bytes not
               verified in the same call chain (copyto.c file_encoding
               scan, cstring pg_mblen in formatting.c / varlena.c) should
               be identified.
           (b) load-time -- byte-preserving carry-over that skips
               pg_verify_mbstr (pg_upgrade relfilenode preservation,
               physical replication / WAL replay, pre-patch on-disk data
               from earlier versions) should be reasoned about.
      (P4) Cross-encoding contamination -- it should be confirmed that no
           Japanese-shaped assumption (SS2/SS3, JIS X 0201/0212,
           pg_eucjp_increment, shared pg_euc_* helpers) can reach EUC-KR
           data after the split.
      (P5) User-visible width -- pg_encoding_max_length('EUC_KR') changes
           from 3 to 2; buffer-sizing / width-bounded consumers should be
           checked.
    
    Review-coverage list (not a test matrix):
    
      Encoding-layer surface (P5)
      - pg_wchar_table[PG_EUC_KR].maxmblen and readers of
        pg_encoding_max_length() that switch on EUC-KR width
      - pg_database_encoding_max_length() callers in mbutils.c
    
      Direct callers of pg_mblen / pg_dsplen (P2)
      - varlena.c:    text_substring, text_position, text_left/right,
                      text_reverse, length-family
      - oracle_compat.c: lpad/rpad/ltrim/rtrim/btrim/translate
      - formatting.c: to_char width handling (pg_mblen_cstr)
      - varchar.c:    bpchar / varchar length enforcement
                      (pg_mbcharcliplen)
      - xml.c, jsonfuncs.c: pg_mblen on textual content
    
      Pattern matching (P2)
      - like.c: multibyte LIKE
    
      Call-time verify-bypass (P3a)
      - copyto.c: file_encoding delimiter scan
      - formatting.c / varlena.c: cstring pg_mblen callers
    
      Load-time verify-bypass (P3b) -- highest follow-up priority
      - pg_upgrade: relfilenode preservation (bytes as-is)
      - Physical replication / WAL replay
      - Pre-patch on-disk data from earlier versions
    
      Pre-existing-data behavior (P3b consequence)
      - length/char_length/octet_length, substring/left/right/position,
        LIKE / regex on rows containing 0x8E/0x8F
      - Sort order / btree text indexes built before the patch
        (bytewise comparator, likely safe but worth naming)
    
      tsearch (P2)
      - spell.c, ts_locale.c, wparser_def.c, dict_thesaurus.c
    
      Conversion proc sanity
      - utf8_and_euc_kr (table-driven, expected unchanged)
    
    These are the items I think are worth a line or two of reasoning
    from someone who has read the code, though I don't think they
    need to block this patch -- treating the review-coverage list as
    background and following up separately is a perfectly fine
    outcome.  I plan to work through them myself gradually, as time
    allows, and share what I find along the way; other eyes would
    be welcome.
    
    ----------------------------------------------------------------------
    
    On 2026-05-13, Michael Paquier wrote:
    
    > I unfortunately cannot read the PNG file [...]
    
    I read KS X 2901 clause 4.2 directly in Korean; SungJun's later
    transcription matches the normative text verbatim.
    
    (On src/test/locale/: agreed with SungJun -- a separate concern
    from this patch.)
    
    > - the proposed patch has zero regression tests.
    
    Agreed on the coverage gap.  SungJun's proposed 0001 baseline
    tests (pg_encoding_max_length, G0/ASCII handling, G1/KS X 1001
    handling, SS2 rejection) are a sensible starting point and worth
    adopting as-is.  Beyond that, the review-coverage list above is
    what I plan to walk as follow-up: some items will naturally
    suggest further regression tests, while others (notably the
    load-time bypass paths in P3b) can only be reasoned about in
    review.  Any additional tests can come out of that pass.
    
    ----------------------------------------------------------------------
    
    On 2026-05-13, SungJun Jang wrote (v2):
    
    > v2 attached. [commit message and release note updated]
    
    Commit-message update LGTM.  On the release-19.sgml hunk: release
    notes are curated by the release manager from commit messages
    after the fact, not written into the patch.  It would probably
    be cleaner for v3 to omit that hunk and let the commit message
    carry the user-visible change instead (pg_encoding_max_length
    3 -> 2, plus the pre-existing 0x8E/0x8F behavior shift via the
    P3b routes).  The charset.sgml Table 23.3 fix should stay --
    that is documented behavior, not a release note.
    
    ----------------------------------------------------------------------
    
    On 2026-05-13, SungJun Jang wrote (reply to Michael):
    
    > [KS X 2901 clause text quoted in Korean and English]
    
    Confirmed -- both match standard.go.kr.
    
    > I am planning to split into two patches:
    >   0001 -- Baseline capture [...]
    >   0002 -- Actual fix [...]
    
    The split is a nice shape for review -- characterization-first
    makes the behavior delta easy to see, and "exactly one line moves"
    is a useful audit signal.  Whether to keep the split or squash at
    commit time is a committer call.
    
    Regards,
    Henson Choi