LW - We are headed into an extreme compute overhang by devrandom

The Nonlinear Library

Sisällön tarjoaa The Nonlinear Fund. The Nonlinear Fund tai sen podcast-alustan kumppani lataa ja toimittaa kaiken podcast-sisällön, mukaan lukien jaksot, grafiikat ja podcast-kuvaukset. Jos uskot jonkun käyttävän tekijänoikeudella suojattua teostasi ilman lupaasi, voit seurata tässä https://fi.player.fm/legal kuvattua prosessia.

14d ago 4:29

MP3•Jakson koti

Welcome to The Nonlinear Library, where we use Text-to-Speech software to convert the best writing from the Rationalist and EA communities into audio. This is: We are headed into an extreme compute overhang, published by devrandom on April 28, 2024 on LessWrong. If we achieve AGI-level performance using an LLM-like approach, the training hardware will be capable of running ~1,000,000s concurrent instances of the model. Definitions Although there is some debate about the definition of compute overhang, I believe that the AI Impacts definition matches the original use, and I prefer it: "enough computing hardware to run many powerful AI systems already exists by the time the software to run such systems is developed". A large compute overhang leads to additional risk due to faster takeoff. I use the types of superintelligence defined in Bostrom's Superintelligence book (summary here). I use the definition of AGI in this Metaculus question. The adversarial Turing test portion of the definition is not very relevant to this post. Thesis Due to practical reasons, the compute requirements for training LLMs is several orders of magnitude larger than what is required for running a single inference instance. In particular, a single NVIDIA H100 GPU can run inference at a throughput of about 2000 tokens/s, while Meta trained Llama3 70B on a GPU cluster[1] of about 24,000 GPUs. Assuming we require a performance of 40 tokens/s, the training cluster can run 20004024000=1,200,000 concurrent instances of the resulting 70B model. I will assume that the above ratios hold for an AGI level model. Considering the amount of data children absorb via the vision pathway, the amount of training data for LLMs may not be that much higher than the data humans are trained on, and so the current ratios are a useful anchor. This is explored further in the appendix. Given the above ratios, we will have the capacity for ~1e6 AGI instances at the moment that training is complete. This will likely lead to superintelligence via "collective superintelligence" approach. Additional speed may be then available via accelerators such as GroqChip, which produces 300 tokens/s for a single instance of a 70B model. This would result in a "speed superintelligence" or a combined "speed+collective superintelligence". From AGI to ASI With 1e6 AGIs, we may be able to construct an ASI, with the AGIs collaborating in a "collective superintelligence". Similar to groups of collaborating humans, a collective superintelligence divides tasks among its members for concurrent execution. AGIs derived from the same model are likely to collaborate more effectively than humans because their weights are identical. Any fine-tune can be applied to all members, and text produced by one can be understood by all members. Tasks that are inherently serial would benefit more from a speedup instead of a division of tasks. An accelerator such as GroqChip will be able to accelerate serial thought speed by a factor of 10x or more. Counterpoints It may be the case that a collective of sub-AGI models can reach AGI capability. It would be advantageous if we could achieve AGI earlier, with sub-AGI components, at a higher hardware cost per instance. This will reduce the compute overhang at the critical point in time. There may a paradigm change on the path to AGI resulting in smaller training clusters, reducing the overhang at the critical point. Conclusion A single AGI may be able to replace one human worker, presenting minimal risk. A fleet of 1,000,000 AGIs may give rise to a collective superintelligence. This capability is likely to be available immediately upon training the AGI model. We may be able to mitigate the overhang by achieving AGI with a cluster of sub-AGI components. Appendix - Training Data Volume A calculation of training data processed by humans during development: time: ~20 years, or 6e8 seconds raw data input: ~10 mb/s = 1e7 b/s total for human training data: 6e15 bits Llama3 training s...

2416 jaksoa

#Podcasting Education #The Nonlinear Fund