Skip to content

filer: route large Volumes writes through the multipart engine (off by default)#5756

Open
renaudhartert-db wants to merge 1 commit into
multipart/03-enginefrom
multipart/04-filer
Open

filer: route large Volumes writes through the multipart engine (off by default)#5756
renaudhartert-db wants to merge 1 commit into
multipart/03-enginefrom
multipart/04-filer

Conversation

@renaudhartert-db

@renaudhartert-db renaudhartert-db commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Context

databricks fs cp and bundle library uploads to Unity Catalog Volumes go through a single PUT /api/2.0/fs/files, which caps a file at the single-request size limit and pushes it over one connection. This stack adds chunked upload (multipart on AWS/Azure, resumable on GCP) so large files upload reliably and in parallel. The whole feature is gated behind the DATABRICKS_EXPERIMENTAL_MULTIPART_UPLOAD environment variable and is off by default, so merging the stack changes no behavior until the flag is set.

Stack

  1. libs/upload/cloudstorage: add cloud-storage transfer client #5753 cloud-storage data-plane client
  2. libs/upload/files: add Files API control-plane client #5754 Files API control-plane client
  3. libs/upload: add the chunked large-file upload engine #5755 chunked upload engine
  4. filer: route large Volumes writes through the multipart engine (off by default) #5756 route large Volumes writes through the engine (this PR)
  5. fs cp: share one multipart transfer budget across a recursive copy #5757 fs cp shared transfer budget
  6. fs cp: show an upload progress bar for a single large-file copy #5758 fs cp progress bar

This PR

Routes large Unity Catalog Volumes writes in libs/filer.FilesClient.Write through the upload engine, behind DATABRICKS_EXPERIMENTAL_MULTIPART_UPLOAD (off by default). When enabled, a seekable write to a Volume goes through the engine, which sends small files in a single PUT and splits large ones into parts; non-seekable streams keep the existing single-shot PUT, and everything is unchanged when the flag is off. The filer builds one shared concurrency limiter and transfer client (via new NewFilesClient options) so concurrent writes draw from a single bounded budget and connection pool. A 409 from the engine is mapped to fs.ErrExist, preserving skip-if-exists. Note the blast radius: the bundle library-upload path uses this same filer, so with the flag on its large libraries also go multipart.

Testing

Unit tests cover the seekability check, the already-exists to fs.ErrExist mapping, the env-var gate, and the upload-concurrency option.

This pull request and its description were written by Isaac.

@github-actions

Copy link
Copy Markdown
Contributor

Waiting for approval

Based on git history, these people are best suited to review:

  • @pietern -- recent work in libs/filer/

Eligible reviewers: @Divyansh-db, @chrisst, @hectorcast-db, @mihaimitrea-db, @parthban-db, @rauchy, @simonfaltum, @tanmay-db, @tejaskochar-db

Suggestions based on git history. See OWNERS for ownership rules.

@eng-dev-ecosystem-bot

eng-dev-ecosystem-bot commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

Integration test report

Commit: 42be751

Run: 28328652595

Env 🟨​KNOWN 💚​RECOVERED 🙈​SKIP ✅​pass 🙈​skip Time
🟨​ aws linux 7 1 13 235 1035 7:15
🟨​ aws windows 7 1 13 237 1033 10:49
💚​ aws-ucws linux 8 13 322 952 6:35
💚​ aws-ucws windows 8 13 324 950 8:39
💚​ azure linux 2 15 235 1034 6:02
💚​ azure windows 2 15 237 1032 7:02
💚​ azure-ucws linux 2 15 324 949 6:52
💚​ azure-ucws windows 2 15 326 947 7:30
💚​ gcp linux 2 15 234 1036 6:18
💚​ gcp windows 2 15 236 1034 8:03
21 interesting tests: 13 SKIP, 7 KNOWN, 1 RECOVERED
Test Name aws linux aws windows aws-ucws linux aws-ucws windows azure linux azure windows azure-ucws linux azure-ucws windows gcp linux gcp windows
🟨​ TestAccept 🟨​K 🟨​K 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
🙈​ TestAccept/bundle/invariant/no_drift 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/permissions 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/with_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions 🟨​K 🟨​K 💚​R 💚​R 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=direct 🟨​K 🟨​K 💚​R 💚​R
🟨​ TestAccept/bundle/resources/permissions/jobs/destroy_without_mgmtperms/without_permissions/DATABRICKS_BUNDLE_ENGINE=terraform 🟨​K 🟨​K 💚​R 💚​R
🙈​ TestAccept/bundle/resources/postgres_branches/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/recreate 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/replace_existing 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/update_protected 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_branches/without_branch_id 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈��S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_endpoints/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/postgres_projects/update_display_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/synced_database_tables/basic 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_endpoints/drift/recreated_same_name 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/bundle/resources/vector_search_indexes/recreate/embedding_dimension 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
🙈​ TestAccept/ssh/connection 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S 🙈​S
💚​ TestFetchRepositoryInfoAPI_FromRepo 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R 💚​R
Top 8 slowest tests (at least 2 minutes):
duration env testname
7:08 aws-ucws windows TestAccept
6:56 gcp windows TestAccept
6:02 azure-ucws windows TestAccept
5:53 azure windows TestAccept
2:48 azure-ucws linux TestAccept
2:48 azure linux TestAccept
2:46 gcp linux TestAccept
2:45 aws-ucws linux TestAccept
…y default)

Wires the large-file upload engine into FilesClient.Write behind the
DATABRICKS_EXPERIMENTAL_MULTIPART_UPLOAD environment variable, disabled by
default so behavior is unchanged unless the flag is set. When enabled, seekable
writes to a UC Volume go through the engine (single-shot for small files,
multipart for large ones); non-seekable streams keep the existing single-shot
PUT. NewFilesClient gains options for a shared concurrency limiter and transfer
client, and an already-exists conflict from the engine maps to fs.ErrExist so
skip-if-exists keeps working.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

2 participants