GPUs for Hardware Professionals
Copyright© Schmied Enterprises LLC, 2025.
The big question for many companies right now is: should we invest in new GPUs?
The answer is a resounding yes. Recent conferences, like the one hosted by Supermicro & AMD, highlight the suitability of AMD processors for AI inference, even for models with 10 billion parameters. But GPUs offer something more, something crucial beyond just AI processing: graphics. Visual elements in documents, marketing materials, and user interfaces demand significantly more computing power than text. So, the need for GPUs isn't solely driven by AI; it's fueled by the ever-increasing demand for rich, visually engaging content.
Let's break down the GPU decision from a CIO's perspective. The output of AI systems is increasingly multimodal, encompassing text, audio, images, and video. Your infrastructure needs to handle the bandwidth required for the latter three, and that's where GPUs shine. Since GPUs are often the most expensive component in your bill of materials (BOM), maximizing their utilization is key. Consider optimizing peak workloads with cloud-based GPU rentals or specialized software solutions like Phison's AIDAPTIV, which uses SSD caching for PyTorch to offload some of the burden from expensive GPUs.
Ultimately, your GPU needs are determined by the sustained workload stream required to support your employees and AI processors. Anything beyond that can be supplemented with the solutions mentioned above.
Here are some guiding principles to consider:
Rule A: Output Exceeds Input. Ensure your corporate system has higher output bandwidth than input. This is crucial because all data written needs to be read at least once for filtering and indexing, regardless of whether it's used later.
Rule B: AI as a State Machine. Model your AI systems as a state machine with a "context," a memory block defining the problem and potential solutions. Each context maps to one or more potential next context blocks stored in GPU memory, system memory, or low-latency SSDs. Think of it as a two-dimensional array where the context points to the next relevant block.
Rule C: Creativity and Context. If your context suggests multiple potential blocks, you're likely dealing with a creative or generative AI task. This might involve synonyms or stages requiring stable diffusion for image generation. However, in most cases, you'll want to select a single, specific block to merge with the current context. Multiple choices could indicate inadequate training or an overly broad context. Refine your request with a more specific prompt.
Rule D: Training Success. A system that consistently provides a single, specific block for each context state indicates successful training. This can serve as a test to determine if further training epochs are necessary.
Rule E: Reproducible Creativity. When you need creative output, like generating test images to understand your model's behavior, use seeds to ensure reproducibility. This is essential for enterprise applications. Avoid relying on randomness to save on expenses and ensure consistent results.
Rule F: Optimize GPU Utilization. Aim for 100% GPU utilization within your cluster, but recognize that relying solely on GPUs can diminish returns. Underutilized GPUs waste money and energy. However, hotspots will always exist within the model blocks stored in memory and SSDs. Using GPUs to process infrequently accessed blocks can increase your system's token price.
Rule G: SSDs for Inference. Leverage SSD lookups for inference. From a hardware perspective, AI is essentially a grid of blocks specifying the next block to choose based on the context state. SSD blocks are integral to this, just like host and GPU DDRAM. GPUs handle stream processing of the retrieved blocks. This is particularly useful for images and videos, allowing multiple GPUs to fetch and process different blocks of an image, generating avatars on the fly. The inference step can be extended with a rendering layer, converting physics or OpenUSD scenes into actual video frames.
Rule H: Vector Search. The block-based description of AI models suggests that we do not need to compare the context with the key to every single block in the system. Vector similarity algorithms in RAG can be used to identify the potential good blocks for the answer of the next step. Graph algorithms are just indexes on the blocks of state transitions in this model. The underlying hardware is always a grid of blocks, even if they are indexed, so that we just search a constant number of them at a time for the right answer.
Rule I: Incremental Training. This block based indexed storage suggests that the training step of emplacing the next block for a new training data can be as simple as adding it to a partial index. The grid can be partitioned by professions for faster training, a method used by DeepSeek in 2025 for an improvement of a magnitude in training costs.
Rule J: Efficient Training. We can redefine training to an algorithm that can be more efficient. The training data can be split by topics, and fine grained enough, so that each training step is just a fine tuning with a shallow layer of new information on the current model. This chained training model can be O(n) linear compared to the training set. Training on a 125M word set can be replaced as 250000 fine tuning steps as pages in a row. In fact this can help training small language models by selecting the right books to train on. Dependencies even allow to train sub-models in parallel.
Rule K: Differential Training. This model of reranking and fine tuning the model even at the very beginning can make the training process more approachable as today. We can just start with a plain English grammar book, add Maths, etc. until we reach the desired goal of a simple vending machine. Training can be easy, repeatable, auditable, and formally verifiable.
Rule L: Model Chaining. Models can even be chained after this model. If the answer is not specific, it can contain clues to find the right follow up just like RAGs find the right context today. Chaining models can also solve the hybrid nature of MCP tooling.
Rule M: Grounding. The formally verifiable training data contains a set of questions that can be requested to regenerate the training data from the model. This is useful to give verification steps called grounding to fix hallucinations of the models themselves returning proven answers.
Rule N: KV Storage. Once a model can be fine-tuned with many steps, it can be flipped for grounding, the model can become an index like the case of zip files with KV possibilities giving much more than a tree file system hierarchy. Model chaining as a result does not even need raw training data but a flippable trained sub-model to fine tune with.
Link of the day.
Learn more about Phison's AIDAPTIV solution here.