AllCloud Logo

AllCloud

GPU Engineer

Posted 8 Days Ago
Be an Early Applicant
Remote
Hiring Remotely in United States
Senior level
Remote
Hiring Remotely in United States
Senior level
The GPU Engineer will design and optimize GPU-based infrastructure for LLM training, focusing on performance enhancements and resource management in a cloud environment.
The summary above was generated by AI
Description

GPU Engineer

Location: US / Canada (Eastern Time) - Home based

Job Type: Full-time, Permanent 

About AllCloud

AllCloud is a global professional services company providing organizations with cloud enablement and transformation tools. As an AWS Premier Consulting Partner and audited MSP, a Salesforce Platinum Partner, and a Snowflake Premier Partner, AllCloud helps clients connect their front and back offices by building a new operating model to harness the benefits of cloud technology and data and analytics.

Job Summary

We are seeking an experienced GPU Engineer to join our innovative AI team at AllCloud. This role will be responsible for designing, implementing, and optimizing GPU-based infrastructure for large-scale LLM training and inference. The ideal candidate will have deep expertise in GPU architecture, parallel computing, and performance optimization for machine learning workloads. You'll work closely with our LLM Architects and ML Engineers to build and maintain the high-performance computing environment required for training our custom transformer-based language models.

Responsibilities

  • Design and implement scalable GPU clusters on AWS infrastructure for distributed LLM training
  • Optimize GPU memory usage, computational throughput, and inter-node communication for transformer model training
  • Configure and tune GPU acceleration libraries (CUDA, cuDNN, NCCL) for maximum performance
  • Implement mixed precision training and other optimization techniques to improve training efficiency
  • Architect and deploy GPU-based inference solutions that balance latency, throughput, and cost
  • Create benchmarking tools to measure and improve model training and inference performance
  • Establish monitoring and management systems for GPU resources to maximize utilization and reliability
  • Collaborate with LLM Architects to implement parallelization strategies (model, data, pipeline parallelism)
  • Troubleshoot hardware and software issues affecting GPU performance
  • Keep current with advancements in GPU technology and AI accelerator hardware


Requirements

Summary of Key Requirements

  • 5+ years of experience optimizing GPU infrastructure for machine learning workloads
  • Advanced knowledge of NVIDIA GPU architecture and CUDA programming
  • Strong understanding of HPC computing, AI network architecture, and physical layer management.
  • Experience with AWS GPU instances (e.g., P4d, P5, G5) and AWS Batch for ML workloads
  • Strong background in distributed computing and parallel processing techniques
  • Familiarity with transformer architecture and deep learning frameworks like PyTorch or TensorFlow
  • Expertise in performance profiling and bottleneck identification in GPU workloads
  • Experience with containerization (Docker) and orchestration (Kubernetes)
  • Understanding of memory optimization techniques for large language models
  • Bachelor's degree in Computer Science, Electrical Engineering, or related field (Master's preferred)

Certifications

  • AWS Certified Solutions Architect - Professional (Strongly Preferred)
  • NVIDIA-Certified Professional: Accelerated Data Science (Preferred)
  • NVIDIA-Certified Professional: AI Infrastructure or AI Networking (NCP-AIN) (Preferred)

Why work for us? 

Our team inspires progress in each other and in our customers through our relentless pursuit of excellence; you will work with leaders who promote learning and personal development.


AllCloud is an Equal Opportunity Employer and considers applicants for employment without regard to race, color, religion, sex, orientation, national origin, age, disability, genetics or any other basis forbidden under federal, provincial, or local law.


Top Skills

AWS
Cuda
Cudnn
Docker
Gpu Clusters
Kubernetes
Nccl
PyTorch
TensorFlow
HQ

AllCloud Denver, Colorado, USA Office

1624 Market St, Suite 226, Denver, Colorado, United States, 80202

Similar Jobs

9 Days Ago
Remote
CA, USA
148K-288K
Senior level
148K-288K
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
You will develop SOC drivers, build automation tools, validate drivers, and support various operating system drivers while collaborating with global teams.
Top Skills: AcpiArm MicroarchitectureCC++GccGdbLinuxLlvmMsvcPythonWindbgWindows
12 Days Ago
Remote
2 Locations
184K-357K
Senior level
184K-357K
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The Senior Software Engineer will develop system software solutions for GPUs, focusing on display features, optimization strategies, and collaborating with teams on architecture specifications.
Top Skills: CDevice DriverEdpHdmiOperating System InternalsReal-Time Embedded Operating SystemsVesa Display Port Standards
13 Days Ago
Remote
2 Locations
144K-270K
Senior level
144K-270K
Senior level
Artificial Intelligence • Computer Vision • Hardware • Robotics • Metaverse
The role involves automating and maintaining GPU clusters, driving CI/CD processes, streamlining release management, and resolving operational issues in a collaborative environment.
Top Skills: AnsibleCi/CdGrafanaInfinibandLinuxNvlinkPrometheusPythonShellSlurm

What you need to know about the Colorado Tech Scene

With a business-friendly climate and research universities like CU Boulder and Colorado State, Colorado has made a name for itself as a startup ecosystem. The state boasts a skilled workforce and high quality of life thanks to its affordable housing, vibrant cultural scene and unparalleled opportunities for outdoor recreation. Colorado is also home to the National Renewable Energy Laboratory, helping cement its status as a hub for renewable energy innovation.

Key Facts About Colorado Tech

  • Number of Tech Workers: 260,000; 8.5% of overall workforce (2024 CompTIA survey)
  • Major Tech Employers: Lockheed Martin, Century Link, Comcast, BAE Systems, Level 3
  • Key Industries: Software, artificial intelligence, aerospace, e-commerce, fintech, healthtech
  • Funding Landscape: $4.9 billion in VC funding in 2024 (Pitchbook)
  • Notable Investors: Access Venture Partners, Ridgeline Ventures, Techstars, Blackhorn Ventures
  • Research Centers and Universities: Colorado School of Mines, University of Colorado Boulder, University of Denver, Colorado State University, Mesa Laboratory, Space Science Institute, National Center for Atmospheric Research, National Renewable Energy Laboratory, Gottlieb Institute

Sign up now Access later

Create Free Account

Please log in or sign up to report this job.

Create Free Account