-
Notifications
You must be signed in to change notification settings - Fork 13.5k
gpu offload host code generation #142097
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
gpu offload host code generation #142097
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@oli-obk Featurewise, I am almost done. I'll add a few more lines to describe the layout of Rust types to the offload library, but in this PR I only intend to support one type or two (maybe array's, raw pointer, or slices). I might even hardcode the length in the very first approach. In a follow-up PR I'll do some proper type parsing on a higher level, similar to what I did in the past with Rust TypeTrees. This work is much simpler and more reliable though, since offload doesn't care what type something has, just how many bytes it is large, and hence need to be moved to/from the GPU. I was able to just move a few of the builder methods I needed to the generic builder. |
Not fully ready yet, I apparently missed yet another global to initialize the offload runtime. But at least it compiles successfully to a binary if I emit the IR from Rust, and then use clang for the rest. I'll add the global today, then I should be done and will clean it up |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
☔ The latest upstream changes (presumably #142644) made this pull request unmergeable. Please resolve the merge conflicts. |
100f9f3
to
0fb93f0
Compare
Jay, turns out the only issue in my test binary was a bug in LLVM, which was already fixed upstream in llvm/llvm-project#143638. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Some changes occurred in compiler/rustc_codegen_ssa |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did the first round of reviews for myself, I'll address them tomorrow.
I'll also clean up the code in gpu_builder more, it has a lot of duplications and IR comments from when I was trying to figure out what to generate..
@@ -117,6 +118,70 @@ impl<'a, 'll, CX: Borrow<SCx<'ll>>> GenericBuilder<'a, 'll, CX> { | |||
} | |||
bx | |||
} | |||
|
|||
pub(crate) fn my_alloca2(&mut self, ty: &'ll Type, align: Align, name: &str) -> &'ll Value { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll find a better name for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also document why/how it is different from alloca
ok, I think I'm mostly done. Do you have any suggestions? I don't want to add any actual run tests, as these would require a working clang based on the same commit. |
@@ -667,6 +668,12 @@ pub(crate) fn run_pass_manager( | |||
write::llvm_optimize(cgcx, dcx, module, None, config, opt_level, opt_stage, stage)?; | |||
} | |||
|
|||
if cfg!(llvm_enzyme) && enable_gpu && !thin { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no dependency of offload on Enzyme, but since I think I'm supposed to gate my features, for now I'll just re-use the ones from Enzyme.
☔ The latest upstream changes (presumably #143026) made this pull request unmergeable. Please resolve the merge conflicts. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I... don't know if I can review this properly. I can review it from the "does this fit into how I want the llvm backend to look" side, but what it actually does just looks random to me.
@@ -117,6 +118,70 @@ impl<'a, 'll, CX: Borrow<SCx<'ll>>> GenericBuilder<'a, 'll, CX> { | |||
} | |||
bx | |||
} | |||
|
|||
pub(crate) fn my_alloca2(&mut self, ty: &'ll Type, align: Align, name: &str) -> &'ll Value { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also document why/how it is different from alloca
let llcx = llvm::LLVMRustContextCreate(false); | ||
let module_name = CString::new("offload.wrapper.module").unwrap(); | ||
let llmod = llvm::LLVMModuleCreateWithNameInContext(module_name.as_ptr(), llcx); | ||
let cx = SimpleCx::new(llmod, llcx, cgcx.pointer_size); | ||
let tptr = cx.type_ptr(); | ||
let ti64 = cx.type_i64(); | ||
let ti32 = cx.type_i32(); | ||
let ti16 = cx.type_i16(); | ||
let dl_cstr = llvm::LLVMGetDataLayoutStr(old_cx.llmod); | ||
llvm::LLVMSetDataLayout(llmod, dl_cstr); | ||
let target_cstr = llvm::LLVMGetTarget(old_cx.llmod); | ||
llvm::LLVMSetTarget(llmod, target_cstr); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shares a bit of code with create_module
, can we do better? Or at least make all the individual functions have safe wrappers
offload_entry_ty | ||
} | ||
|
||
fn gen_globals<'ll>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh. please do one function per global, they are mostly unrelated after all
let foo = crate::declare::declare_simple_fn( | ||
&cx, | ||
&mapper_begin, | ||
llvm::CallConv::CCallConv, | ||
llvm::UnnamedAddr::No, | ||
llvm::Visibility::Default, | ||
mapper_fn_ty, | ||
); | ||
let bar = crate::declare::declare_simple_fn( | ||
&cx, | ||
&mapper_update, | ||
llvm::CallConv::CCallConv, | ||
llvm::UnnamedAddr::No, | ||
llvm::Visibility::Default, | ||
mapper_fn_ty, | ||
); | ||
let baz = crate::declare::declare_simple_fn( | ||
&cx, | ||
&mapper_end, | ||
llvm::CallConv::CCallConv, | ||
llvm::UnnamedAddr::No, | ||
llvm::Visibility::Default, | ||
mapper_fn_ty, | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name these the same as at the use site
why is clang necessary for this? |
This time I started with dev guide docs! https://rustc-dev-guide.rust-lang.org/offload/installation.html#usage |
Thanks! And no worries, I'm discussing the offloading design with @jdoerfert and @kevinsala. The memory transfer is pretty straightforward and not that interesting. The only question was how many layers of abstraction we wanted, but we made a decision which should be fine, we could always re-evaluate it later. For the Kernel launches PR I'll ask them to also review the code, but they aren't rust devs, so your reviews on the rustc side are definetly appreciated! |
r? ghost
This will generate most of the host side code to use llvm's offload feature.
The first PR will only handle automatic mem-transfers to and from the device.
So if a user calls a kernel, we will copy inputs back and forth, but we won't do the actual kernel launch.
Before merging, we will use LLVM's Info infrastructure to verify that the memcopies match what openmp offloa generates in C++.
LIBOMPTARGET_INFO=-1 ./my_rust_binary
should print that a memcpy to and later from the device is happening.A follow-up PR will generate the actual device-side kernel which will then do computations on the GPU.
A third PR will implement manual host2device and device2host functionality, but the goal is to minimize cases where a user has to overwrite our default handling due to performance issues.
I'm trying to get a full MVP out first, so this just recognizes GPU functions based on magic names. The final frontend will obviously move this over to use proper macros, like I'm already doing it for the autodiff work.
This work will also be compatible with std::autodiff, so one can differentiate GPU kernels.
Tracking: