Summary
The built-in @generate_surrogate_key macro only produces valid output for MD5. For
hash_function := 'SHA256' (and 'SHA512') it emits a bare SHA256(CONCAT(...)), which
is invalid on Trino/Presto: sha256 accepts only varbinary (not varchar), and there is
no TO_HEX, so the result would be raw bytes rather than the documented string key.
Versions
- sqlmesh 0.235.4
- sqlglot 30.8.0
- engine: Trino (also affects Presto)
Reproduction
from sqlglot import parse_one
from sqlmesh.core.macros import MacroEvaluator
ev = MacroEvaluator(dialect="trino")
sql = "SELECT @GENERATE_SURROGATE_KEY(a, b, hash_function := 'SHA256') AS k"
print(ev.transform(parse_one(sql, dialect="trino")).sql("trino"))
Actual output
SELECT SHA256(CONCAT(
CAST(COALESCE(CAST(a AS VARCHAR), '_sqlmesh_surrogate_key_null_') AS VARCHAR),
CAST('|' AS VARCHAR),
CAST(COALESCE(CAST(b AS VARCHAR), '_sqlmesh_surrogate_key_null_') AS VARCHAR)
)) AS k
Running this on Trino fails at execution time:
TrinoUserError(type=USER_ERROR, name=FUNCTION_NOT_FOUND,
message="Unexpected parameters (varchar) for function sha256. Expected: sha256(varbinary)")
Expected
A surrogate key that is valid (and returns a string) on the target engine, e.g. on Trino:
SELECT LOWER(TO_HEX(SHA256(TO_UTF8(CONCAT(...))))) AS k
Root cause
In sqlmesh/core/macros.py, generate_surrogate_key builds a generic hash node and then
only fixes up MD5:
func = exp.func(hash_function.name, exp.func("CONCAT", *string_fields), dialect=evaluator.dialect)
if isinstance(func, exp.MD5Digest):
func = exp.MD5(this=func.this) # gives Trino LOWER(TO_HEX(MD5(TO_UTF8(...))))
return func
exp.MD5 is a semantic node that the Trino/Presto generator wraps in TO_UTF8/TO_HEX.
No analogous handling exists for SHA-256/SHA-512, so those hashes fall through as a literal
SHA256() and break on engines where the digest function requires varbinary
and/or returns binary.
Suggested direction
Have the macro construct an engine-valid string digest for non-MD5 hashes too (i.e. apply
the same TO_UTF8 input / TO_HEX output treatment for engines that need it), rather than
relying on sqlglot to opinionatedly rewrite a generic SHA256(...) call. At minimum,
SHA256/SHA512 should be documented + tested per supported engine, since hash_function
is an advertised option.
For now I have this custom macro to work around it:
import typing as t
from sqlmesh import macro
from sqlglot import exp
@macro()
def surrogate_key(
evaluator,
*fields: exp.Expr,
hash_function: exp.Literal = exp.Literal.string("SHA256"),
) -> exp.Expr:
"""Generates a hashed surrogate key (lowercase hex string) for the given fields.
Field handling matches the built-in @generate_surrogate_key (NULLs coalesced to
'_sqlmesh_surrogate_key_null_', joined by '|'), but the hash step is correct for
Trino's binary digest functions (MD5, SHA1, SHA256, SHA512): the concatenated input
is wrapped in TO_UTF8 (they require varbinary) and the binary digest is converted
back to a lowercase hex string via TO_HEX/LOWER.
The digest is built as an Anonymous node so sqlglot emits the function name verbatim;
exp.func("MD5", ...) would hit sqlglot's MD5 semantic node and double-wrap it.
"""
string_fields: t.List[exp.Expr] = []
for i, field in enumerate(fields):
if i > 0:
string_fields.append(exp.Literal.string("|"))
string_fields.append(
exp.func(
"COALESCE",
exp.cast(field, exp.DataType.build("text")),
exp.Literal.string("_sqlmesh_surrogate_key_null_"),
)
)
digest = exp.Anonymous(
this=hash_function.name.lower(),
expressions=[exp.func("TO_UTF8", exp.func("CONCAT", *string_fields))],
)
return exp.func("LOWER", exp.func("TO_HEX", digest))
Summary
The built-in
@generate_surrogate_keymacro only produces valid output for MD5. Forhash_function := 'SHA256'(and'SHA512') it emits a bareSHA256(CONCAT(...)), whichis invalid on Trino/Presto:
sha256accepts onlyvarbinary(notvarchar), and there isno
TO_HEX, so the result would be raw bytes rather than the documented string key.Versions
Reproduction
Actual output
Running this on Trino fails at execution time:
TrinoUserError(type=USER_ERROR, name=FUNCTION_NOT_FOUND,
message="Unexpected parameters (varchar) for function sha256. Expected: sha256(varbinary)")
Expected
A surrogate key that is valid (and returns a string) on the target engine, e.g. on Trino:
Root cause
In sqlmesh/core/macros.py, generate_surrogate_key builds a generic hash node and then
only fixes up MD5:
func = exp.func(hash_function.name, exp.func("CONCAT", *string_fields), dialect=evaluator.dialect)
if isinstance(func, exp.MD5Digest):
func = exp.MD5(this=func.this) # gives Trino LOWER(TO_HEX(MD5(TO_UTF8(...))))
return func
exp.MD5 is a semantic node that the Trino/Presto generator wraps in TO_UTF8/TO_HEX.
No analogous handling exists for SHA-256/SHA-512, so those hashes fall through as a literal
SHA256() and break on engines where the digest function requires varbinary
and/or returns binary.
Suggested direction
Have the macro construct an engine-valid string digest for non-MD5 hashes too (i.e. apply
the same TO_UTF8 input / TO_HEX output treatment for engines that need it), rather than
relying on sqlglot to opinionatedly rewrite a generic SHA256(...) call. At minimum,
SHA256/SHA512 should be documented + tested per supported engine, since hash_function
is an advertised option.
For now I have this custom macro to work around it: