The 10,000-word report on Wall Street: Behind the plunge of Bitcoin and Nvidia

A professional investor who has worked as an analyst and software engineer wrote an article that was bearish on Nvidia, which was retweeted by Twitter’s big V, becoming a major “culprit” in the plunge of Nvidia’s stock.Nvidia’s market value evaporated by nearly $600 billion, the largest single-day drop for a particular listed company to date.

The main point of this Jeffrey Emanuel investor is that DeepSeek exposes the cowhide made by Wall Street, large technology companies and Nvidia, which is overrated.”Every investment bank recommends buying Nvidia, like a blind man giving a guide, and has no idea what they are saying.”

Jeffrey Emanuel said Nvidia faces a much rougher road to maintaining its current growth trajectory and profit margins than its valuation suggests.There are five different directions to attack Nvidia – architecture innovation, customer vertical integration, software abstraction, efficiency breakthroughs and manufacturing democratization – at least one chance of success has a significant impact on Nvidia’s profit margins or growth rates seems to be very goodhigh.Judging from the current valuation, the market has not taken these risks into consideration.

According to some industry investors, Emanuel suddenly became a Wall Street celebrity because of this report, and many hedge funds paid him $1,000 an hour to hear his views on Nvidia and AI.My throat was so busy that I was smoking, but I was tempted to count the money.

The following is the full text of the report.Full reference learning.

As an investment analyst for about 10 years in various long/short hedge funds, including working at Millennium and Balyasny, and a math and computer fan who has been studying deep learning since 2010 (At that time, Geoff Hinton was still talking about restricted Boltzmann machines, everything was still programming using MATLAB, researchers were still trying to prove that they could get better results in classifying handwritten numbers than using support vector machines), I think I was on the artificialThere is a rather unique view on the development of intelligent technology and its relationship with equity valuation in the stock market.

Over the past few years, I have worked more as a developer and have several popular open source projects for handling various forms of AI models/services (see LLM Aided OCR, Swiss Army Llama for example, Fast Vector Similarity, Source to Prompt and Pastel Inference Layer, etc.).Basically, I use these cutting-edge models intensively every day.I have 3 Claude accounts so I don’t run out of requests and I signed up for it a few minutes after ChatGPT Pro went live.

I also strive to understand the latest research progress and carefully read all the important technical report papers released by major artificial intelligence laboratories.Therefore, I think I have a pretty good understanding of this field and how things are going.Meanwhile, I shorted a lot of stocks in my life and won the Best Creativity Award from the Value Investor Club twice (TMS longs and PDH shorts if you’ve been following).

I say this not to show off, but to prove that I can speak on this issue without making technicians or professional investors feel that I am hopelessly childish.Of course, there are definitely many people who are more proficient in mathematics/science than me, and there are many people who are better at long/short investment in the stock market than me, but I think there are not many people who can be in the middle of the Venn chart like me.

Nevertheless, whenever I meet and chat with friends and former colleagues in the hedge fund industry, the topic quickly turns to Nvidia.The phenomenon of a company growing from obscurity to a market value exceeding the combined stock markets of the UK, France or Germany is not something you can encounter every day!These friends naturally want to know what I think about this issue.Because I firmly believe that this technology will have a long-term transformative impact – I really believe it will completely change every aspect of our economy and society in the next 5-10 years, which is basically unprecedented – so it’s hard for me to assert NvidiaThe development momentum will slow down or stop in the short term.

But even though I think the valuation is too high for me over the past year, the recent series of developments have made me a little lean towards my intuition of being more cautious about the prospects and in consensusIt seems to be questioned when overpriced.As the saying goes, “The wise believe in the beginning, and the foolish believe in the end.” There is a reason why this sentence is famous.

Bull market case

Before we discuss the progress that made me hesitate, let’s briefly review the bull market of NVDA stocks. Now basically everyone knows the bull market of NVDA stocks.Deep learning and artificial intelligence are the most transformative technologies since the Internet and are expected to fundamentally change everything in our society.In terms of the part of the industry’s total capital expenditures used for training and reasoning infrastructure, Nvidia is almost in a position to be close to monopoly in some way.

Some of the world’s largest and most profitable companies, such as Microsoft, Apple, Amazon, Meta, Google, Oracle, etc., have decided to stay competitive in this field at all costs because they simply cannot afford the consequences of falling behind others..The amount of capital expenditure, electricity consumption, the area of new data centers, and of course the number of GPUs, have all exploded and there seems to be no sign of slowing down.Nvidia can earn amazing gross profit margins of up to 90% with high-end products for data centers.

We just touched the surface of the bull market.There are more aspects now, even those who are already very optimistic will become more optimistic.Apart from the rise of humanoid robots (I suspect that most people will be surprised when they can quickly complete tasks that currently require unskilled (or even skilled) workers, such as laundry, cleaning, tidying and cooking; done in a worker teamConstruction work such as decorating a bathroom or building a house; managing a warehouse and driving a forklift, etc.), and there are other other factors that most people have not even considered.

A major topic that smart people talk about is the rise of the “new law of expansion”, which provides a new paradigm for people to think about how computing demand will grow over time.Since the emergence of AlexNet in 2012 and the invention of Transformer architecture in 2017, the original expansion law that drives the advancement of artificial intelligence is the pre-training expansion law: the higher the value of the token we use as training data (now to trillions), the model we trainedThe more parameters we use, the higher the computational power (FLOPS) we use to train these models with these tokens, and the better the performance of the final model will be in a variety of very useful downstream tasks.

Not only that, this improvement is so predictable to a certain extent that leading AI labs like OpenAI and Anthropic can even know exactly how good their latest models will be before they even begin to actually train.In some cases, they are even able to predict the benchmark value of the final model with an error of no more than a few percentage points.This “primitive law of expansion” is very important, but it always makes people who use it to predict the future.

First, we seem to have exhausted the high-quality training datasets accumulated in the world.Of course, this is not entirely true—there are still many old books and journals that have not been properly digitized, even if they are digitized, without proper permission as training data.The problem is, even if you attribute all this to you – say the sum of written English “professional” produced from 1500 to 2000, when you talk about a training corpus of nearly 15 trillion marks, from percentageFrom a perspective, this is not a huge number, and the scale of the training corpus is the scale of the current cutting-edge model.

To quickly check the authenticity of these numbers: Google Books has digitized about 40 million books so far; if a normal book has 50,000 to 100,000 words, or 65,000 to 130,000 marks, then the book aloneIt accounts for 2.6T to 5.2T marks, of course a large part of which has been included in the training corpus used in large laboratories, regardless of whether it is strictly legal or not.There are also many academic papers, and there are more than 2 million papers on the arXiv website alone.The Library of Congress has more than 3 billion pages of digital newspapers.Added up, the total may be as high as 7T tokens, but since most of it is actually included in the training corpus, the remaining “incremental” training data may not be that important in the overall plan.

Of course, there are other ways to collect more training data.For example, you can automatically transcribe every YouTube video and use these texts.While this may help, it is certainly much lower in quality than a highly regarded textbook of organic chemistry, which is a useful source of knowledge to understand the world.Therefore, in terms of the original law of scale, we have always faced the threat of a “data wall”; although we know that we can continue to invest more capital expenditures into GPUs and build more data centers, mass production of useful new human knowledgeIt is much more difficult, and this knowledge is the correct complement to existing knowledge.Now, an interesting way to deal with it is the rise of “synthetic data”, that is, the text itself is the output of LLM.While this may seem a bit ridiculous, “improving the quality of the model through your own supply” is indeed very effective in practice, at least in the fields of mathematics, logic and computer programming.

Of course, the reason is that we can mechanically check and prove the correctness of things in these areas.So we can sample from a huge mathematical theorem or Python script and then actually check if they are correct, only the correct data will be included in our database.In this way, we can greatly expand the collection of high-quality training data, at least in these areas.

In addition to text, we can also use various other data to train artificial intelligence.For example, what would happen if we used all genome sequencing data of 100 million people (the amount of uncompressed data of one person is about 200GB to 300GB) to train artificial intelligence?This is obviously a large amount of data, although most of it is almost exactly the same between two people.Of course, comparing with text data on books and the internet can be misleading for a variety of reasons:

The original genome size cannot be compared directly with the number of markers

The information content of genomic data is very different from that of text

The training value of highly redundant data is not yet clear

The calculation requirements for processing genomic data are also different

But it is still another huge source of information that we can train it in the future, which is why I include it.

So while we are expected to get more and more additional training data, if you look at the growth rate of training corpus in recent years, we will soon find that we will encounter bottlenecks in the availability of “universal useful” knowledge data.And this kind of knowledge can help us get closer to our ultimate goal, which is to obtain artificial super intelligence 10 times smarter than John von Neumann, and become world-class experts in every professional field known to humans.

In addition to the limited available data, proponents of the pre-training expansion law have always been hiding some other concerns in mind.One of them is how to handle all of these compute infrastructures after completing model training?Training the next model?Of course, you can do that, but given the rapid increase in GPU speed and capacity, and the importance of power and other operating costs in economic computing, does it really make sense to use clusters from 2 years ago to train new models?Of course, you prefer to use a brand new data center you just built, which costs 10 times the cost of an old data center, and because of its more advanced technology, it performs 20 times the performance of an old data center.The problem is, at some point, you do need to amortize the upfront costs of these investments and recover the costs through (hopefully positive) operating profit streams, right?

The market is so excited about artificial intelligence that it ignores this point, allowing companies like OpenAI to accumulate operating losses from the beginning, while at the same time gaining higher and higher valuations in subsequent investments (of course, it is worthy of praise., they also show very fast-growing revenue).But ultimately, to maintain this throughout the market cycle, the costs of these data centers will eventually need to be recovered and it is best to be profitable, so that after a period of time, they can be combined with other investment opportunities based on risk adjustments.Confrontation.

New paradigm

OK, this is the law of pre-training expansion.So what is this “new” expansion law?Well, this is something people have only started to focus on in the past year: inference time calculation extensions.Before this, most of the calculations you spent in the process were used to create the model’s preliminary training calculations.Once you have a trained model, reasoning about that model (i.e. asking a question or having the LLM perform some kind of task for you) just uses a certain number of calculations.

Importantly, the total amount of inference calculations (measured in various ways, such as FLOPS, GPU memory footprint, etc.) is much lower than the amount of calculations required in the pre-training phase.Of course, the inference computation does increase when you increase the context window size of the model and the output generated at once (although the researchers have made amazing algorithmic improvements in this regard, and the scale of expansion that people originally expected was quadratic).But basically, until recently, inference calculations were often much lower in intensity than training calculations and were essentially linearly proportional to the number of requests processed—for example, the more demands for ChatGPT text completion, the more inference calculations consumed..

With the emergence of the revolutionary Chain-of-Thought (COT) model launched last year, the most notable one is OpenAI’s flagship model O1 (but recently DeepSeek’s new model R1 also uses this technology.It will be discussed in detail later), everything has changed.These new COT models no longer directly proportional to the output text length generated by the model (for larger context windows, model sizes, etc., it will increase proportionally), but instead generates intermediate “logical markers”;It is considered a “temporary memory” or “internal monologue” of the model when trying to solve your problem or complete a specified task.

This represents a real change in the way in which inference computing: now, the more tokens you use in this internal thinking process, the better the quality of the final output you provide to the user.In fact, it’s like giving a worker more time and resources to complete a task so they can check their work repeatedly, complete the same basic task in a number of different ways, and verify that the results are the same;”insert” the result into the formula to check if it actually solved the equation, etc.

As it turns out, the effect of this approach is almost amazing; it leverages the power of long-awaited “reinforcement learning” and the power of the Transformer architecture.It directly solves one of the biggest weaknesses in the Transformer model, namely the tendency to “create hallucinations”.

Basically, the way Transformers work when predicting the next marker for each step is that if they start to go on a wrong “path” in the initial response, they become almost like a shirking child trying to make up a storyto explain why they are actually correct, even if they should use common sense to realize on the way that what they say cannot be correct.

Because the models always try to maintain intrinsic consistency and make each continuously generated marker naturally from the preceding marker and context, they are difficult to route correction and backtrack.By breaking down the reasoning process into many intermediate stages, they can try many different methods, see which ones work, and keep trying route corrections and other methods until they can reach a fairly high level of confidence that they are not bullshit.

The most special thing about this approach is that, besides it does work, the more logic/COT tokens you use, the better the effect.Suddenly, you have an extra turntable and as the number of COT inference tokens increases (this requires more inference calculations, whether it is floating point operations or memory), the higher the probability that you will give the correct answer – codeThere are no errors at the first run, or the solution to the logic problem does not have obvious errors in the inference steps.

I can tell you from a lot of first-hand experience that while Anthropic’s Claude3.5 Sonnet model is excellent (very excellent) in Python programming, it always makes one whenever you need to generate any lengthy and complex code.Or multiple stupid mistakes.Now, these errors are usually easy to fix, and in fact, it is often just necessary to use the error generated by the Python interpreter as a subsequent reasoning hint (or, more practically, the complete “problem” found in the code editor using the so-called Lintersets pasted into the code) and they can be fixed without any further explanation.When the code becomes very long or very complex, it sometimes takes longer to fix it, and it may even require some manual debugging.

When I first tried OpenAI’s O1 model, it was like a revelation: I was amazed at how perfect the code was the first time.This is because the COT process automatically discovers and fixes the problem before finally responding to the token in the answer given by the model.

In fact, the O1 model used in OpenAI’s ChatGPT Plus subscription service ($20 per month) is in conjunction with the new ChatGPT Pro subscription service (prices 10 times the former, i.e. $200 per month, which caused an uproar in the developer community)The models used by the O1-Pro model in the O1-Pro model are basically the same; the main difference is that O1-Pro will think longer before responding, generate more COT logic marks, and each response consumes a lot of inference computing resources.

This is very striking because even for Claude3.5 Sonnet or GPT4o, even if given a context of about 400kb or more, a very verbose and complex hint usually takes less than 10 seconds to start responding, andOften less than 5 seconds.And the same prompt for O1-Pro may take more than 5 minutes to get a response (although OpenAI does show you some of the “reasoning steps” generated during the waiting process; importantly, OpenAI is commercial forSecretly related reasons, decide to hide the exact reasoning tags it generates from you, instead show you a highly simplified summary).

As you might imagine, in many cases accuracy is crucial – you would rather give up and tell the user that you simply cannot do it than give an answer that might be easily proved wrong, or give an illusion involvedFacts or other plausible but not reasoning answers.Anything that involves money/transactions, medical care and law, to name just a few.

Basically, as long as the inference cost is trivial relative to the full hourly compensation of human knowledge workers who interact with AI systems, in this case, calling COT calculations becomes completely unnecessary (the main disadvantage is that itThis will greatly increase the response latency, so in some cases you may prefer to speed up the iteration by getting a response with shorter latency, lower accuracy, or lower correctness).

A few weeks ago, some exciting news came out in the field of artificial intelligence, involving the O3 model that has not yet been released by OpenAI, which can solve a series of problems previously thought could not be solved with existing artificial intelligence methods in the near future.OpenAI can solve these most difficult problems (including extremely difficult “basic” mathematical problems that are difficult for even very skilled professional mathematicians) because OpenAI invests a lot of computing resources—in some cases,Spend more than $3,000 in computing power to solve a task (in contrast, using a conventional Transformer model, the traditional inference cost for a single task is unlikely to exceed a few dollars without a thinking chain).

It is not necessary for AI geniuses to realize that this progress creates a completely new law of expansion that is completely different from the original pre-trained law of expansion.Now you still want to train the best models by cleverly leveraging as many computing resources as possible and as many trillion high-quality training data as possible, but this is just the beginning of this new world story; now you can easily use itA surprising number of computing resources, inferring only from these models to gain a very high confidence level, or trying to solve extremely difficult problems requiring “genius-level” reasoning to avoid all potential pitfalls that can lead to ordinaryMaster of Law went astray.

But why does Nvidia have to take all the benefits?

Even if you believe like me, the future prospects of artificial intelligence are almost unimaginable, the question remains: “Why does a company make most of its profits from this technology?” There are indeed many important new technologies in history that have changed the world.But the main winners are not the companies that look most promising in the initial stages.Although Wright Brothers’ aircraft company invented and perfected the technology, the company now has a market capitalization of less than $10 billion, even though it has evolved into multiple companies.Although Ford now has a considerable market value of $40 billion, that’s only 1.1% of Nvidia’s current market value.

To understand this, you must truly understand why Nvidia can occupy such a large market share.After all, they are not the only company that makes GPUs.AMD produces GPUs with good performance. Judging from the data, the number of transistors, process nodes, etc. are comparable to Nvidia.Of course, AMD GPUs are not as fast and advanced as Nvidia GPUs, but Nvidia GPUs are not 10 times faster or similar.In fact, AMD GPUs are only half as much as Nvidia GPUs in terms of raw cost per FLOP.

From the perspective of other semiconductor markets, such as the DRAM market, although the market is highly concentrated, only three global companies (Samsung, Micron, SK-Hynix) have practical significance, the gross profit margin of the DRAM market is negative at the bottom of the cycle.The top of the cycle is about 60%, and the average is around 20%.In contrast, Nvidia’s overall gross profit margin in recent quarters was about 75%, mainly dragged down by consumer-grade 3D graphics products with low profit margins and high commodification.

So, how is this possible?Well, the main reason has to do with the software – a “directly available” and highly tested and highly reliable driver on Linux (unlike AMD, whose Linux drivers are notorious for their low quality and unstable quality), as well as highly optimizedOpen source code, such as PyTorch, works well on Nvidia GPUs after tweaking.

Not only that, CUDA, a programming framework used by programmers to write low-level code optimized for GPUs, is completely owned by Nvidia and has become the de facto standard.If you want to hire a group of extremely talented programmers who know how to use GPUs to speed up their jobs and are willing to pay their $650,000/year salary, or the current salary level of anyone with this particular skill, then they are likely to beWill “think” and work with CUDA.

Apart from software advantages, another major advantage of Nvidia is the so-called interconnect—essentially, it is a bandwidth that efficiently connects thousands of GPUs together so that they can be used together to train today’s cutting-edge fundamental models.In short, the key to efficient training is to keep all GPUs in full use all the time, rather than idling and waiting until the next batch of data required for the next step of training is received.

The bandwidth requirements are very high, far higher than the typical bandwidth required for traditional data center applications.This interconnection cannot use traditional network devices or fiber because they bring too much latency and cannot provide terabytes of bandwidth per second, which is what is needed to keep all GPUs constantly busy.

Nvidia acquired Israeli company Mellanox for $6.9 billion in 2019, a very wise decision, and it was this acquisition that provided them with industry-leading interconnect technology.Note that the interconnect speed is more closely related to the training process (must utilize the output of thousands of GPUs at the same time) than the inference process (including COT inference), and the inference process only uses a small number of GPUs – what you needJust enough VRAM to store the quantized (compressed) model weights of the trained model.

It can be said that these are the main components of Nvidia’s “moat” and the reason why it can maintain such high profit margins for a long time (there is also a “flywheel effect”, that is, they actively invest extraordinary profits into large amounts of research and development, which in turnAnd help them improve their technology faster than their competitors, so they are always ahead of the way in raw performance).

But as noted earlier, in all other cases where the same conditions are often the performance per dollar (including upfront capital expenditure costs of the device and energy use, i.e. performance per watt), although Nvidia’s GPU isThe fastest ones, but if measured by FLOPS alone, they are not the most cost-effective.

But the problem is that other factors are not the same, AMD’s drivers are terrible, popular AI software libraries don’t run well on AMD GPUs, and outside the gaming field, you can’t find GPU experts who are really good at AMD GPUs (Why do they bother, there is a greater demand for CUDA experts in the market? ) You can’t effectively connect thousands of GPUs due to AMD’s poor interconnect technology – all of which means AMD is in high-end data centersThe field is basically uncompetitive and does not seem to have good development prospects in the short term.

OK, it sounds like Nvidia has a great outlook, right?Now you know why its stock is so valuated!But are there any other hidden worries?Well, I don’t think there are many hidden worries worthy of major attention.Some problems have been lurking behind the scenes for the past few years, but their impact is minimal given the rate of growth.But they are preparing to move upward.Other issues have only emerged recently (such as the last two weeks) and may significantly change the trajectory of recent growth in GPU demand.

Major Threats

From a macro perspective, you can think this way: Nvidia has been operating in a very niche field for a long time; their competitors are very limited, and these competitors are not profitable and have insufficient growth rate.to pose a real threat because they don’t have enough capital to really put pressure on market leaders like Nvidia.The gaming market is large and growing, but it does not bring in amazing profits or particularly amazing annual growth rates.

Around 2016-2017, some large tech companies began to increase recruitment and spending on machine learning and artificial intelligence, but overall, this was never really a project they were important – more like the “lunar exploration program”R&D expenditure.But after the release of ChatGPT in 2022, competition in the field of artificial intelligence has really begun. Although it is only more than two years away, it seems that it has been a long time since the speed of development.

Suddenly, big companies are ready to invest billions at an alarming rate.The number of researchers participating in large research conferences such as Neurips and ICML has surged.Smart students who might have previously studied financial derivatives turned to Transformers, and the compensation for more than one million dollars in non-executive engineering positions (i.e. independent contributors who do not manage teams) became the norm for leading AI labs.

Change the direction of a large cruise ship takes a while; even if you move very quickly and spend billions of dollars, it takes a year or more to build a brand new data center, order all the equipment (the lead time will be extended), andComplete all setup and debugging.It takes a long time for even the smartest programmers to really get into state and be familiar with existing code bases and infrastructure.

But you can imagine that the money, manpower and energy invested in this area are absolutely astronomical.Nvidia is the biggest target of all participants because they are the biggest contributors to today’s profits, not in the future where artificial intelligence dictates our lives.

Therefore, the most important conclusion is that “the market will always find a way out”, and they will find alternative, thoroughly innovative new ways to make hardware, using brand-new ideas to bypass obstacles, thereby consolidating Nvidia’s moat.

Hardware-level threats

For example, Cerebras’ so-called “wafer-level” artificial intelligence training chips use the entire 300mm silicon wafer for an absolutely huge chip that contains orders of magnitude more transistors and cores on a single chip (see their recentBlog posts to learn how they addressed the output issues that had prevented this approach from being economically practical).

To illustrate this, if you compare Cerebras’ latest WSE-3 chip to Nvidia’s flagship data center GPU H100, the total chip area of Cerebras chip is 46225 square millimeters, while the H100 is only 814 square millimeters (by industry standards, the H100It’s a huge chip in itself); this is a multiple of 57 times!Instead of enabling 132 “streaming multiprocessor” cores on the chip like the H100, the Cerebras chip has about 900,000 cores (of course, each core is smaller and has fewer features, but by comparison,This number is still very large).Specifically, in the field of artificial intelligence, the FLOPS computing power of Cerebras chips is about 32 times that of a single H100 chip.Since the price of the H100 chip is close to US$40,000, it is conceivable that the price of the WSE-3 chip is not cheap either.

So, what’s the point?Instead of trying to fight Nvidia head-on with a similar approach, or to rival Mellanox’s interconnect technology, Cerebras has adopted a completely new approach to bypassing the interconnection problem: when everything runs on the same super-large chipWhen the bandwidth problem between processors becomes less important.You don’t even need the same level of interconnection, as a giant chip can replace tons of H100.

Moreover, Cerebras chips also perform very well in artificial intelligence inference tasks.In fact, you can try it here for free today and use Meta’s very famous Llama-3.3-70B model.Its response speed is basically instant, with about 1500 tokens per second.From a comparison perspective, the speed of more than 30 tokens per second is relatively fast for users compared to ChatGPT and Claude, and even 10 tokens per second is fast enough to basically generate a response while generating a response.Read it.

Cerebras isn’t the only company, there are others like Groq (not to be confused with Elon Musk’s X AI-trained Grok model series).Groq takes another innovative approach to address the same fundamental problem.Instead of trying to compete directly with Nvidia’s CUDA software stack, they developed what is known as a “tensor processing unit” (TPU) specifically for the precise math operations required for deep learning models.Their chips are designed around the concept of “deterministic computing”, which means that unlike traditional GPUs, their chips perform operations in a completely predictable way every time.

This may sound like a small technical detail, but it actually has a huge impact on chip design and software development.Since time is completely certain, Groq can optimize its chips, something that traditional GPU architectures cannot do.So, over the past six months, they have been showing the inference speed of over 500 tokens per second for the Llama series models and other open source models, far exceeding the speed that traditional GPU settings can achieve.Like Cerebras, this product is now available and you can try it for free here.

Using the Llama3 model with “speculative decoding” function, Groq is able to generate 1320 tokens per second, which is comparable to Cerebras and far exceeds the performance of using a regular GPU.Now, you might ask what it means to reach over 1000 tokens per second when users seem to be quite satisfied with ChatGPT’s speed (less than 1000 tokens per second).In fact, this is indeed very important.When you get instant feedback, iterating faster and you don’t lose focus like a human knowledge worker does.If you use the model programmatically through the API, it can enable completely new categories of applications that require multi-stage inference (the output of the previous phase is used as input for subsequent phase prompts/inferences), or require low-latency responses, e.g.Content review, fraud detection, dynamic pricing, etc.

But more fundamentally, the faster the response to requests, the faster the cycle and the busier the hardware.While Groq’s hardware is very expensive, with a server costing up to $2 million to $3 million, if the demand is large enough to keep the hardware busy all the time, the cost of each request is greatly reduced.

Like Nvidia’s CUDA, a large part of Groq’s strengths come from its proprietary software stack.They were able to take open source models developed and released for free by other companies such as Meta, DeepSeek and Mistral, and break them down in special ways to make them run faster on specific hardware.

Like Cerebras, they make different technical decisions to optimize certain aspects of the process, thus carrying out their work in a completely different way.Take Groq as an example, they are completely focused on inference-level computing, rather than training: all their special hardware and software can only exert huge speed and efficiency advantages when inference on already trained models.

But if the next major law of expansion that people expect is inference-level computing, and the biggest disadvantage of the COT model is that all intermediate logical markers must be generated to respond, resulting in excessive latency, even a company that only does inference computing,As long as its speed and efficiency are far superior to Nvidia, it will also bring serious competitive threats in the next few years.At least, Cerebras and Groq can erode the high expectations of Nvidia’s revenue growth in the current stock valuation.

Aside from these particularly innovative but relatively unknown start-up competitors, some of Nvidia’s biggest customers themselves have brought serious competition, who have been making customized chips specifically for AI training and reasoning workloads.The most famous of these is Google, which has been developing its own proprietary TPU since 2016.Interestingly, although Google has briefly sold TPUs to external customers, Google has been using all of its TPUs internally for the past few years, and it has launched its sixth-generation TPU hardware.

Amazon is also developing its own custom chips called Trainium2 and Inferentia2.Amazon is building data centers with billions of dollars in Nvidia GPUs, while they are also investing billions of dollars in other data centers that use these in-house chips.They have a cluster that is now online for Anthropic, which has over 400,000 chips.

Amazon has been criticized for completely screwing up the development of internal AI model, wasting a lot of internal computing resources on models that ultimately have no competitiveness, but custom chips are another matter.Again, they don’t necessarily need their own chips to be better and faster than Nvidia’s.All they need is good enough chips, but to make chips with break-even gross margins, rather than the roughly 90% gross margin Nvidia earns on its H100 business.

OpenAI also announced their plans to make custom chips, and they (with Microsoft) are obviously the largest users of Nvidia’s data center hardware.It seems that this is not enough, Microsoft itself announced its own custom chip!

As the world’s most valuable technology company, Apple has disrupted people’s expectations for many years with its highly innovative and disruptive custom chip business. Now, its custom chip business has completely defeated Intel and AMD in terms of performance per watt.CPU, and per watt performance is the most important factor in mobile (phone/tablet/laptop) applications.For years, they have been producing their own in-house designed GPUs and “neural processors”, although they haven’t really proven the practicality of these chips outside of their custom applications, such as advanced software-based image processing used in iPhone cameras.

While Apple seems to be different from these other players, its focus on mobile-first, consumer-oriented and “edge computing”, if Apple ends up investing enough money in its new contract with OpenAI,iPhone users offer AI services, then you have to imagine that they have teams working on how to make their own custom chips for reasoning/training (although you may never know this directly, given their confidentiality!).

Now, it is no secret that Nvidia’s super extender customer base exhibits a strong power law distribution, with a few top customers accounting for the vast majority of high profit revenue.How should we view the future of this business when each of these VIP customers is making their own custom chips specifically for AI training and reasoning?

When thinking about these issues, you should remember a very important fact: Nvidia is largely an intellectual property-based company.They don’t make their own chips.The secret to making these incredible devices is probably more from TSMC and ASML, which make special EUV lithography machines for making these cutting-edge process node chips.This is crucial because TSMC will sell state-of-the-art chips to any customer who is willing to provide sufficient upfront investment and guarantee a certain number of customers.They don’t care that these chips are used for Bitcoin mining specific integrated circuits, graphics processors, thermoplastic polyurethane, mobile phone system-based chips, etc.

What is the annual income of senior Nvidia chip designers, and these tech giants will certainly be able to put out enough cash and stocks to attract some of the best talents to jump jobs.Once they have the team and resources, they can design innovative chips in 2 to 3 years (maybe there is not even the advanced 50% of the H100, but with Nvidia’s gross profit margin, they still have a lot of room for development), and thanks to TSMC, they can use the exact same process node technology as Nvidia to convert these chips into actual silicon wafers.

Software Threat

It seems that these imminent hardware threats are not bad enough, and some progress has also been made in the software sector over the past few years, and although it has a slow start, it is now strong and may pose a serious threat to Nvidia’s CUDA software dominance.First is the bad Linux driver for AMD GPUs.Remember when we discussed why AMD has unwisely allowed these drivers to be so bad for years, but sit back and watch a lot of money go away?

Interestingly, the infamous hacker George Hotz, known for jailbreaking the original iPhone as a teenager, is currently the CEO of self-driving startup Comma.ai and artificial intelligence computer company Tiny Corp, who alsoDeveloped an open source tinygrad AI software framework) recently announced that he was tired of dealing with AMD’s bad drivers and was eager to be able to use the less expensive AMD GPU in his TinyBox AI computer (are a variety of models, some of which use NvidiaGPUs, while others use AMD GPUs).

In fact, he made his own custom drivers and software stack for AMD GPUs without AMD’s help; on January 15, 2025, he tweeted through the company’s X account: “We are completely autonomous from AMD.Stack RDNA3 assembler is just one step away. We have our own drivers, runtimes, libraries and emulators. (All about 12,000 lines!)” Given his past records and skills, they are likely to complete them in the next few monthsAll work, this will bring many exciting possibilities to use AMD GPUs to meet the needs of various applications, and companies currently have to pay for Nvidia GPUs.

Well, this is just a driver for AMD and it’s not done yet.What else?Well, there are other areas that have a greater impact on software.First of all, many large technology companies and open source software communities are now working together to develop more general AI software frameworks, among which CUDA is just one of many “compilation goals”.

That is, you write software using higher-level abstractions, and the system itself can automatically convert these high-level structures into super-optimized low-level code, which works great on CUDA.But since it is done at this higher level of abstraction, it can be easily compiled into low-level code, thus running well on many other GPUs and TPUs that come from various vendors, such as the major onesA large number of customized chips are being developed by technology companies.

The most famous examples of these frameworks are MLX (mainly sponsored by Apple), Triton (mainly sponsored by OpenAI), and JAX (mainly developed by Google).MLX is especially interesting because it provides a PyTorch-like API that can run efficiently on Apple Silicon, showing how these abstraction layers enable AI workloads to run on completely different architectures.Triton, meanwhile, is becoming increasingly popular because it allows developers to write high-performance code that can be compiled to run on a variety of hardware goals without having to understand the underlying details of each platform.

These frameworks allow developers to write code with powerful abstractions and then automatically compile against a large number of platforms – doesn’t this sound more efficient?This approach provides greater flexibility when actually running the code.

In the 1980s, all the most popular and best-selling software was written in hand-modified assembly language.For example, the PKZIP compression utility is so handmade to maximize speed that the version of the code written in the standard C programming language and compiled with the best optimization compiler at the time may run at just the manual tuning assembly codeHalf of it.The same is true for other popular software packages such as WordStar, VisiCalc, etc.

Over time, compilers have become more and more powerful, and whenever the CPU architecture changes (e.g., from Intel Release 486 to Pentium, etc.), handwritten assemblers usually have to be discarded and re-written, only the mostA smart programmer can do the job (just like CUDA experts are better than a “ordinary” software developer in the job market).Eventually things gradually became consistent, and the speed advantage of manual assembly was greatly outweighed by the flexibility of writing code in high-level languages such as C or C++, which relied on the compiler to make the code run at its best on a given CPU.

Nowadays, few people write new code in assembly language.I believe that AI training and inference code will eventually undergo a similar shift for roughly the same reason: computers are good at optimization, while flexibility and development speed are increasingly important factors – especially if it also saves a lot of hardware costs,Because you don’t have to continue paying the “CUDA tax”, which brings Nvidia more than 90% of its profits.

Another area that could change dramatically, however, is that CUDA itself may end up becoming a high-level abstraction—a “canonical language” similar to Verilog (as an industry standard for describing chip layouts) that skilled developers can useTo describe advanced algorithms involving large-scale parallelism (because they are already familiar with it, it is well-structured, is a common language, etc.), but unlike the usual practice, these codes are not compiled for Nvidia GPUs, but are input to LLM as source code, LLM can convert it into any low-level code that can be understood by new Cerebras chips, new Amazon Trainium2 or new Google TPUv6.This is not as far away as you think; using OpenAI’s latest O3 model may be within reach and will certainly be universally implemented in a year or two.

Theoretical threat

Perhaps the most shocking development happened in the previous weeks.The news completely shocked the AI world, and although mainstream media didn’t mention it, it became a hot topic for intellectuals on Twitter: a Chinese startup called DeepSeek released two new models.Performance levels are basically comparable to the best models from OpenAI and Anthropic (beyond the Meta Llama3 model and other smaller open source models like Mistral).These models are called DeepSeek-V3 (basically a response to GPT-4o and Claude3.5 Sonnet) and DeepSeek-R1 (basically a response to OpenAI’s O1 model).

Why is all this so shocking?First, DeepSeek is a small company that is said to have less than 200 employees.They are said to have started as a quantitative trading hedge fund similar to TwoSigma or RenTec, but after China stepped up its regulation of the field, they used their math and engineering expertise to turn to AI research.But the fact is that they released two very detailed technical reports, namely DeepSeek-V3 and DeepSeekR1.

These are high-tech reports, and if you know nothing about linear algebra, it may be difficult to understand.But what you should try is to download the DeepSeek app for free on the AppStore, log in and install it with your Google account, and then give it a try (you can also install it on Android), or try it directly on your desktop with a browser.Make sure to select the “DeepThink” option to enable the thinking chain (R1 model) and let it explain some of the contents of the technical report in simple language.

This will also tell you some important things:

First of all, this model is absolutely legal.There are many fake components in AI benchmarks that are often manipulated to make the model perform well in benchmarks but not in real-world tests.Google is undoubtedly the biggest culprit in this regard, and they always boast about how magical their LLM is, but in fact, these models perform poorly in real-world tests and cannot even reliably complete the simplest tasks, let alone haveA challenging coding task.The DeepSeek model is different, and its response is coherent and powerful, and is at the same level as the OpenAI and Anthropic models.

Secondly, DeepSeek has not only made significant progress in model quality, but more importantly, it has made significant progress in model training and inference efficiency.By being very close to the hardware and by putting together some unique and very clever optimizations, DeepSeek is able to train these incredible models with GPUs in a way that is significantly more efficient.According to some measurements, DeepSeek is about 45 times more efficient than other cutting-edge models.

DeepSeek claims that the entire cost of training DeepSeek-V3 is only over $5 million.According to the standards of OpenAI, Anthropic and other companies, this is nothing at all, because these companies reached the level of a single model training cost of more than $100 million as early as 2024.

How is this possible?How could this small Chinese company completely surpass all the smartest people in our leading AI labs that have more than 100 times the resources, number of employees, salaries, capital, GPUs, and more?Shouldn’t China be weakened by Biden’s restrictions on GPU exports?Well, the details are quite technical, but we can at least describe them in a general way.Perhaps it turns out that DeepSeek’s relatively weak GPU processing power is precisely the key factor in improving its creativity and intelligence, because “demand is the mother of invention.”

A major innovation is their advanced hybrid precision training framework that allows them to use 8-bit floating point numbers (FP8) throughout the training process.Most Western AI labs train using “full precision” 32-bit numbers (this basically specifies the number of possible gradients when describing the output of artificial neurons; 8 bits in FP8 can store a wider range of numbers than you think——It is not limited to 256 equal quantities of different sizes in regular integers, but uses clever mathematical tricks to store very small and very large numbers—although the natural precision is not as good as 32 bits.) The main tradeoff is that while FP32 can beNumbers are stored with amazing precision over a large range, but FP8 sacrifices some precision to save memory and improve performance while still maintaining sufficient precision for many AI workloads.

DeepSeek solves this problem by developing a clever system that breaks down numbers into small chunks for activation and chunks for weights and strategically uses high-precision calculations at key points in the network.Unlike other labs that do high-precision training first and then compress (which loses some quality in the process), DeepSeek’s FP8 native approach means they can save a lot of memory without compromising performance.When you train with thousands of GPUs, the memory requirement for each GPU is greatly reduced, which means the overall number of GPUs required is greatly reduced.

Another big breakthrough is their multi-marker prediction system.Most Transformer-based LLM models infer by predicting the next tag—one tag at a time.

DeepSeek figured out how to predict multiple markers while maintaining the quality of single marker predictions.Their method achieves an accuracy of about 85-90% in these additional marker predictions, effectively doubles the inference speed without sacrificing too much quality.The clever thing is that they maintain the complete causal chain of predictions, so the model is not just a guess, but a structured, context-sensitive prediction.

One of their most innovative developments is what they call the long potential attention (MLA).This is their breakthrough in dealing with so-called key-value indexes, which are basically how a single token is represented in the attention mechanism in the Transformer architecture.While this is a bit too complex from a technical point of view, it can be said that these KV indexes are one of the main uses of VRAM during training and inference, and part of the reason why thousands of GPUs are needed to train these models at the same time – eachThe maximum VRAM of the GPU is 96GB, and these indexes will eat up all of this memory.

Their MLA system found a way to store compressed versions of these indexes that use less memory while capturing basic information.The best part is that this compression is built directly in the way that the model is learned – it’s not a single step they need to do, but it’s built directly into an end-to-end training pipeline.This means that the entire mechanism is “differentiable” and can be trained directly using standard optimizers.The reason why it was successful is that the underlying data representations found by these models were much lower than the so-called “environmental dimension”.So storing a full KV index is a waste, although everyone else basically does it.

Not only is there a lot of space wasted because of storing massive amounts of data that exceeds the actual demand, resulting in a significant increase in training memory footprint and efficiency (again, the number of GPUs required to train a world-class model is greatly reduced), but it can actually improve the model quality because itIt can act as a “regulator” to force the model to focus on what is really important, rather than using wasted capacity to adapt to noise in the training data.So not only do you save a lot of memory, but your model may even perform better.At the very least, you won’t seriously impact performance by saving a lot of memory, which is usually the tradeoff you face in AI training.

They have also made significant progress in GPU communication efficiency through the DualPipe algorithm and custom communication cores.The system intelligently overlaps computing and communications, carefully balancing GPU resources between tasks.They only need about 20 GPUs of stream multiprocessors (SM) to communicate, and the rest is used for computing.The result is that GPU utilization is much higher than typical training settings.

Another very clever thing they do is to use the so-called Hybrid Expert (MOE) Transformer architecture, but key innovations are made around load balancing.As you probably know, the size or capacity of an AI model is usually measured by the number of parameters the model contains.The parameter is just a number that stores certain properties of the model; for example, the “weight” or importance of a particular artificial neuron relative to another neuron, or the specific marker according to its context (in the “attention mechanism”)Importance, etc.

Meta’s latest Llama3 model comes in several sizes, such as: 1 billion parameter version (minimum), 70B parameter model (most commonly used), and even a large model with 405B parameters.For most users, this biggest model has limited practicality, as your computer needs to be equipped with a GPU worth tens of thousands of dollars to run inference at an acceptable speed, at least if you are deploying the original full-precision version.So most of the use and excitement points of these open source models in the real world are at the 8B parameter or highly quantized 70B parameter level, as this is what a consumer-grade Nvidia 4090 GPU can accommodate, and now you can buy it for less than $1,000 nowit.

So, what’s the point of these?In a sense, the number and precision of parameters can tell you how much raw information or data is stored inside the model.Please note that I am not talking about reasoning ability, or the “IQ” of the model: it turns out that even models with very few parameters can be solved in terms of solving complex logic problems, proving plane geometry theorems, SAT mathematical problems, etc.Show excellent cognitive abilities.

But those small models don’t necessarily tell you every aspect of every plot twist in Stendhal’s novel, and real large models have the potential to do that.The “cost” of this extreme level of knowledge is that the model becomes very bulky and difficult to train and reason, because in order to reason about the model, you always need to store each of the 405B parameters (or any number of parameters) at the same timeIn the VRAM of the GPU.

The advantage of the MOE model approach is that you can break down large models into a series of smaller models, each with different, non-overlapping (at least not completely overlapping) knowledge.DeepSeek’s innovation is developing a load balancing strategy they call “no assisted losses” that keeps experts efficiently utilizing without the performance degradation that load balancing usually brings.Then, depending on the nature of the inference request, you can intelligently route inference to the “expert” model in the smaller model in the set that best answers the question or solves the task.

You can think of it as a committee of experts who have their own areas of expertise: one could be a legal expert, the other could be a computer science expert, and the other could be a business strategy expert.So if someone asks a question about linear algebra, you won’t give it to a legal expert.Of course, this is just a very rough analogy, and it is not actually like this.

The real advantage of this approach is that it allows the model to contain a lot of knowledge without being very bulky, because even if the total number of parameters for all experts is high, only a small percentage of them are in “active” at any given time, which meansYou just need to store a small subset of weights in VRAM to make inference.Take DeepSeek-V3 as an example, it has an absolutely huge MOE model with 671B parameters, which is much larger than the largest Llama3 model, but only 37B parameters are active at any given time—enough to accommodate twoVRAM for a consumer-grade Nvidia 4090 GPU (total cost less than $2,000) without one or more H100 GPUs, each selling for about $40,000.

There are rumors that ChatGPT and Claude both use MoE architecture. It is reported that GPT-4 has a total of 1.8 trillion parameters, distributed among 8 models, each containing 220 billion parameters.Although this is much easier than putting all 1.8 trillion parameters into VRAM, due to the huge amount of memory used, it takes multiple H100-level GPUs to run the model alone.

In addition to the above content, the technical paper also mentions several other key optimizations.This includes its extremely memory-saving training framework that avoids tensor parallelism, recalculates certain operations during backpropagation rather than storing them, and shares parameters between the main model and the auxiliary prediction module.The sum of all these innovations, when layered together, leads to about 45 times the efficiency improvement numbers circulating online, and I am completely willing to believe that these numbers are correct.

The cost of DeepSeek’s API is a strong evidence: Although DeepSeek’s model performance is almost the best in class, the cost of making inference requests through its API is 95% lower than similar models in OpenAI and Anthropic.In a sense, it’s a bit like comparing Nvidia’s GPUs to competitors’ new custom chips: Even if they’re not that good, they’re much more cost-effective, so as long as you can determine the performance level and prove that it’s enoughMeet your requirements, and API availability and latency are good enough (so far, people are surprised by DeepSeek’s infrastructure performance, despite an incredible surge in demand due to the performance of these new models).

But unlike Nvidia’s case, Nvidia’s cost difference is due to their acquisition of more than 90% of monopoly gross profit on data center products, while DeepSeek API’s cost difference compared to OpenAI and Anthropic APIs may be just because their computing efficiency is improved by nearly50 times (maybe far more than that in terms of reasoning – about 45 times more efficient in terms of training).In fact, it is unclear whether OpenAI and Anthropic are making huge profits from API services – they may be more concerned about revenue growth and collecting more data by analyzing all API requests received.

Before continuing, I must point out that many people speculate that DeepSeek lied about the number of GPUs and the time it took to train these models because they actually have more H100 than they claim because of the export restrictions of these cards,They don’t want to cause trouble for themselves, nor do they want to harm their chances of getting more of these cards.While this is certainly possible, I think they are more likely to be telling the truth, they have achieved these incredible results only by showing extremely high intelligence and creativity in training and reasoning methods.They explained their approach, and I guess it was only a matter of time before their results were widely replicated and confirmed by other researchers in other labs.

A truly thoughtful model

The updated R1 model and technical reports may be even more shocking as they beat Anthropic on the thinking chain, and now, it’s basically the only one that makes the technology work on a large scale, except OpenAI.But please note that OpenAI will not release the O1 preview model in mid-September 2024.That was just about 4 months ago!One thing you must remember is that OpenAI is very secretive about how these models actually operate at a low level, and will not disclose the actual model weights to anyone except for partners such as Microsoft that have signed strict confidentiality agreements.DeepSeek’s models are completely different, they are completely open source and have loose licenses.They published very detailed technical reports explaining how these models work and providing code that anyone can view and try to copy.

With R1, DeepSeek basically solves a problem in the field of artificial intelligence: letting models reason gradually without relying on large supervised datasets.Their DeepSeek-R1-Zero experiments show this: using pure reinforcement learning with well-designed reward functions, they manage to allow the model to develop complex inference capabilities completely autonomously.It’s not just a problem-solving—the model organically learns to generate long-chain thinking, self-verify its work, and allocates more computational time to more difficult problems.

The technological breakthroughs here are their novel reward modeling methods.Instead of using complex neural reward models, they developed a clever system based on rules, which could lead to “reward hackers” (i.e., the model improves rewards in a false way, but does not actually improve the real performance of the model)., combine accuracy rewards (verify final answers) with format rewards (encourage structured thinking).This simpler approach proved to be more powerful and scalable than the process-based reward model that others have tried.

What is particularly fascinating is that during the training process, they observed the so-called “sudden moments”, in which the model spontaneously learns to modify its thinking process midway when encountering uncertainty.This kind of sudden behavior is not a pre-written program, but is naturally generated by the interaction between the model and the reinforcement learning environment.The model will really stop, mark potential problems in reasoning, and then start over with a different approach, none of which is explicitly trained.

The complete R1 model builds on these insights, introducing what they call “cold start” data—a small set of high-quality examples before applying its reinforcement learning technology.They also solve a major problem in the inference model: language consistency.Previously tried thinking chain reasoning often results in models mixing multiple languages or producing incoherent output.DeepSeek solves this problem by subtly rewarding language consistency during RL training, trading for smaller performance losses for more readable and more consistent output.

The results are incredible: R1 has an accuracy of 79.8% on AIME 2024, one of the most challenging high school math competitions, which is comparable to OpenAI’s O1 model.On the MATH-500, it reached 97.3%, and scored 96.3% in the Codeforces programming competition.But perhaps most impressive is that they managed to distil these abilities into smaller models: their 14B parameter version performs better than many models that are several times larger, suggesting that inference ability is not only related to the number of original parameters, but alsoIt has something to do with how you train the model to process information.

Aftermath

The recent rumor circulating on Twitter and Blind, a corporate rumor website, is that these models are completely beyond Meta’s expectations, and they even outperform the new Llama4 model that is still under training.Apparently, the Llama project inside Meta has attracted the attention of senior technical leaders, so they have about 13 people studying Llama, and each of them has an annual salary total that exceeds the training cost of the DeepSeek-V3 model, which is the DeepSeek-V3 model.The performance is better than Llama.How do you explain to Zuckerberg seriously?When better models are trained with only 2,000 H100s, and cost less than $5 million, Zuckerberg invested billions in Nvidia to buy 100,000 H100s, how could he keep smiling?

But you’d better believe that Meta and other large AI labs are tearing down these DeepSeek models, researching every word in the technical report and every line in the open source code they publish, desperately trying to integrate these same tricks and optimizationsinto their own training and reasoning process.So, what is the impact of all this?Well, naively think that the total demand for training and inference calculations should be divided by some large number.Maybe not 45, but 25 or even 30?Because no matter how much you thought you needed before, now there are a lot less.

Optimists might say, “You’re just talking about a simple proportional constant, a single multiple. When you’re facing an exponential growth curve, these things will quickly disappear and will not be that important in the end.” This does have someThe truth: If AI is really as transformative as I expected, if the actual utility of this technology is measured in trillions, if inferred time calculation is the new law of expansion, if we will have a large number of humanoid robots, they will continue to make a lot of inferences, then maybe the growth curve is still very steep and extreme, Nvidia is still far ahead, it will still succeed.

But Nvidia will have a lot of good news in the coming years to maintain its valuation, and when you take all these factors into account, I’m at least starting to feel very upset about buying its stock at 20 times its expected sales in 2025.What if sales growth slows down slightly?What if the growth rate is not above 100%, but 85%?What happens if gross margin drops from 75% to 70%, which is still high for semiconductor companies?

Summarize

From a macro perspective, Nvidia faces unprecedented competitive threats, which makes its 20x forward sales and 75% gross profit margin increasingly difficult to justify its high valuation.The company’s advantages in hardware, software and efficiency have all emerged with worrying cracks.The world — the thousands of the smartest people on Earth, supported by countless billions of dollars in capital resources — are trying to attack them from all angles.

On the hardware side, Cerebras and Groq’s innovative architectures show that Nvidia’s interconnected advantages, the cornerstone of its data center dominance, can be circumvented by radical redesign.Cerebras’ wafer-level chips and deterministic computing methods for Groq provide compelling performance without the need for NVIDIA’s complex interconnect solutions.More traditionally, every major NVIDIA customer (Google, Amazon, Microsoft, Meta, Apple) is developing custom chips that could eat up high-profit data center revenues.These are no longer experimental projects—Amazon alone is building large-scale infrastructure for Anthropic, which contains more than 400,000 custom chips.

The software moat seems to be equally fragile.New advanced frameworks such as MLX, Triton and JAX are undermining the importance of CUDA, and efforts to improve AMD drivers may develop cheaper hardware alternatives.The trend of advanced abstraction reflects how assembly language gives way to C/C++, suggesting that CUDA’s dominance may be shorter than expected.Most importantly, we see the rise of LLM-based code translation technology, which is able to automatically port CUDA code to run on any hardware target, potentially eliminating one of NVIDIA’s most powerful locking effects.

Perhaps the most destructive thing is the recent breakthrough in efficiency by DeepSeek, which achieves performance comparable to model performance at about 1/45 of the compute cost.This shows that the entire industry has been over-allocating computing resources in large quantities.Coupled with the emergence of a more efficient reasoning architecture through the thinking chain model, the total demand for calculation may be much lower than the current forecast.The economics here are convincing: When DeepSeek is able to achieve GPT-4 level performance, while API call fees are reduced by 95%, it suggests that either Nvidia customers are unnecessarily burning money or profit margins must drop significantly.

TSMC will produce competitive chips for any well-funded customer, setting a cap on Nvidia’s architectural advantages.But more fundamentally, history shows that the market will eventually find ways to bypass artificial bottlenecks, resulting in excess profits.Overall, these threats show that Nvidia faces a much more rugged road to maintaining its current growth trajectory and profit margins than its valuation suggests.There are five different attack directions—architectural innovation, customer vertical integration, software abstraction, efficiency breakthroughs, and manufacturing democratization—there is a high chance that at least one success has a significant impact on Nvidia’s profit margins or growth rates.Judging from the current valuation, the market has not taken these risks into consideration.