Thread

  1. Re: Remove invalid SS2/SS3 handling from EUC-KR routines

    Henson Choi <assam258@gmail.com> — 2026-05-12T11:39:04Z

    Hi SungJun,
    
    Thanks for the patch.  I applied v1 on top of master; it builds
    cleanly and the regression tests pass here.  I agree with the
    direction; a few comments inline.
    
    Per KS X 2901 (formerly KS C 5861-1992), EUC-KR designates only G0
    > (ASCII) and G1 (KS X 1001).  G2 and G3 are not designated; the
    > single-shift codes SS2 (0x8E) and SS3 (0x8F) therefore cannot appear
    > as lead bytes, and no 3-byte sequence is ever valid in EUC-KR.
    >
    
    Right.  I checked the existing pg_euckr_verifychar() in wchar.c and
    it indeed has no SS2/SS3 branch -- it accepts only ASCII and 0xA1-0xFE
    lead bytes via IS_EUC_RANGE_VALID().  So the verifier and the new
    mblen()/mb2wchar_with_len() now tell the same story for valid input,
    which is the goal.
    
    I also did a wider audit for EUC-KR-specific SS2/SS3 handling outside
    wchar.c and found none: the UTF-8 <-> EUC-KR conversion proc is
    clean, pg_eucjp_increment in mbutils.c is EUC-JP only, and EUC-KR
    falls back to pg_generic_charinc which delegates to
    pg_euckr_verifychar (already SS2/SS3-free).  So the three functions
    this patch rewrites are the only entry points; no dangling SS2/SS3
    path remains for EUC-KR after the patch.
    
    - Set maxmblen from 3 to 2 in pg_wchar_table[PG_EUC_KR].
    >
    
    This is the only user-visible change, via pg_encoding_max_length('EUC_KR')
    (see mbutils.c).  The value drops from 3 to 2.  I don't think any
    real client code relies on 3, but the release notes should mention it.
    
    One small observation: after this patch, EUC-KR's mb routines become
    structurally identical to UHC's (1-2 byte Korean, IS_HIGHBIT_SET-only
    branch, maxmblen=2), which is a nice consistency win and arguably the
    right shape for "Korean 1-2 byte EUC".  Could be worth a one-liner in
    the commit message.
    
    +1 from me.
    
    
    ---- Side note on EUC-CN ----
    
    GB 2312 under EUC-CN appears to be in the same standards situation
    -- the existing pg_euccn_mblen comment in wchar.c states "CS2 and
    CS3 are not defined for EUC_CN", and the af79c30dc3e commit message
    similarly says "EUC_CN supports only 1- and 2-byte sequences (CS0,
    CS1)" -- yet pg_euccn_* still carries SS2/SS3 branches and keeps
    maxmblen=3.  As I read it, that shape was a
    deliberate choice in commit af79c30dc3e ("Fix encoding length for
    EUC_CN", CVE-2026-2006) -- the minimal back-patchable fix -- and
    the commit message seems to leave the door open for master to
    "harmonize in a different direction", though I may be reading more
    into it than was intended.
    
    An analogous self-contained cleanup of EUC-CN looks like a natural
    follow-up.  Historically the EUC code in wchar.c was shaped by
    Japanese and Western contributors -- which is why the shared
    pg_euc_* helpers carry JIS X 0201/0212/0208 assumptions -- and
    EUC-CN inherited that shape by delegation.  With the Chinese
    contributor community now well established in the project, an
    EUC-CN cleanup feels like a natural fit for contributors closer to
    that ecosystem, who can also supply native test data, in the same
    way KS X 2901 grounds this patch on the Korean side.  Noting it
    here so the idea stays on the archive; no action requested in this
    thread.
    
    Regards,
    Henson Choi
    
    >