Skip to content

BUG [TRINO]: @generate_surrogate_key macro does not work with SHA256 on trino #5871

Description

@krutoileshii

Summary

The built-in @generate_surrogate_key macro only produces valid output for MD5. For
hash_function := 'SHA256' (and 'SHA512') it emits a bare SHA256(CONCAT(...)), which
is invalid on Trino/Presto: sha256 accepts only varbinary (not varchar), and there is
no TO_HEX, so the result would be raw bytes rather than the documented string key.

Versions

  • sqlmesh 0.235.4
  • sqlglot 30.8.0
  • engine: Trino (also affects Presto)

Reproduction

from sqlglot import parse_one
from sqlmesh.core.macros import MacroEvaluator

ev = MacroEvaluator(dialect="trino")
sql = "SELECT @GENERATE_SURROGATE_KEY(a, b, hash_function := 'SHA256') AS k"
print(ev.transform(parse_one(sql, dialect="trino")).sql("trino"))

Actual output

SELECT SHA256(CONCAT(
  CAST(COALESCE(CAST(a AS VARCHAR), '_sqlmesh_surrogate_key_null_') AS VARCHAR),
  CAST('|' AS VARCHAR),
  CAST(COALESCE(CAST(b AS VARCHAR), '_sqlmesh_surrogate_key_null_') AS VARCHAR)
)) AS k

Running this on Trino fails at execution time:
TrinoUserError(type=USER_ERROR, name=FUNCTION_NOT_FOUND,
message="Unexpected parameters (varchar) for function sha256. Expected: sha256(varbinary)")

Expected

A surrogate key that is valid (and returns a string) on the target engine, e.g. on Trino:

SELECT LOWER(TO_HEX(SHA256(TO_UTF8(CONCAT(...))))) AS k

Root cause

In sqlmesh/core/macros.py, generate_surrogate_key builds a generic hash node and then
only fixes up MD5:
func = exp.func(hash_function.name, exp.func("CONCAT", *string_fields), dialect=evaluator.dialect)
if isinstance(func, exp.MD5Digest):
func = exp.MD5(this=func.this) # gives Trino LOWER(TO_HEX(MD5(TO_UTF8(...))))
return func
exp.MD5 is a semantic node that the Trino/Presto generator wraps in TO_UTF8/TO_HEX.
No analogous handling exists for SHA-256/SHA-512, so those hashes fall through as a literal
SHA256() and break on engines where the digest function requires varbinary
and/or returns binary.

Suggested direction

Have the macro construct an engine-valid string digest for non-MD5 hashes too (i.e. apply
the same TO_UTF8 input / TO_HEX output treatment for engines that need it), rather than
relying on sqlglot to opinionatedly rewrite a generic SHA256(...) call. At minimum,
SHA256/SHA512 should be documented + tested per supported engine, since hash_function
is an advertised option.

For now I have this custom macro to work around it:

import typing as t

from sqlmesh import macro
from sqlglot import exp


@macro()
def surrogate_key(
    evaluator,
    *fields: exp.Expr,
    hash_function: exp.Literal = exp.Literal.string("SHA256"),
) -> exp.Expr:
    """Generates a hashed surrogate key (lowercase hex string) for the given fields.

    Field handling matches the built-in @generate_surrogate_key (NULLs coalesced to
    '_sqlmesh_surrogate_key_null_', joined by '|'), but the hash step is correct for
    Trino's binary digest functions (MD5, SHA1, SHA256, SHA512): the concatenated input
    is wrapped in TO_UTF8 (they require varbinary) and the binary digest is converted
    back to a lowercase hex string via TO_HEX/LOWER.

    The digest is built as an Anonymous node so sqlglot emits the function name verbatim;
    exp.func("MD5", ...) would hit sqlglot's MD5 semantic node and double-wrap it.
    """
    string_fields: t.List[exp.Expr] = []
    for i, field in enumerate(fields):
        if i > 0:
            string_fields.append(exp.Literal.string("|"))
        string_fields.append(
            exp.func(
                "COALESCE",
                exp.cast(field, exp.DataType.build("text")),
                exp.Literal.string("_sqlmesh_surrogate_key_null_"),
            )
        )

    digest = exp.Anonymous(
        this=hash_function.name.lower(),
        expressions=[exp.func("TO_UTF8", exp.func("CONCAT", *string_fields))],
    )
    return exp.func("LOWER", exp.func("TO_HEX", digest))

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions