LLMを理解する必読論文12選

♥ 407↻ 64

1. Attention Is All You Need Transformerを導入した論文。self-attentionがどのようにリカレンスを置き換えるか、そしてなぜこの1つのアーキテクチャがすべての現代LLMを支えているのかを学ぶ。この論文を解読した記事を読めます： Amit Shekhar @amitiitbhu · Apr 7 記事: Transformerアーキテクチャを解読するこのブログでは、Transformerアーキテクチャをパーツごとに解読することで学びます。各コンポーネントが何をするか、どのように連携するか、そしてなぜこのアーキテクチャがすべての... Amit Shekhar @amitiitbhu · Apr 3 記事: Attentionの背後にある数学 — Q、K、V このブログでは、ステップバイステップの数値例を使って、Attention（Query、Key、Value）の背後にある数学を学びます。以下を扱います：Attentionの公式、セットアップから... Amit Shekhar @amitiitbhu · Apr 5 記事: AttentionにおけるスケーリングファクターŸ√dₖの背後にある数学このブログでは、ステップバイステップの数値例を使って、Transformerアーキテクチャでなぜdot product attentionをŸ√dₖでスケールするのかを学びます。 Amit Shekhar @amitiitbhu · Apr 13 記事: LLMにおけるFeed-Forwardネットワーク Feed-ForwardネットワークをLLMの中で学びます。それが何か、Transformerアーキテクチャの中でどう機能するか、なぜすべてのTransformerレイヤーにそれが必要か、どんな役割を... 2. BERT マスク言語モデリングと、双方向コンテキストがなぜBERTを理解・分類タスクのデフォルトにしたかを学ぶ。 3. GPT-3: Language Models are Few-Shot Learners 1750億パラメータのdecoder-onlyモデルがin-context learningを解放する仕組み、つまりプロンプト内のわずかな例から新しいタスクを習得する方法を学ぶ。 4. Scaling Laws for Neural Language Models コンピュート・データ・パラメータに伴い損失がどのように予測可能に低下するかを学ぶ。これによりGPUを1時間も使う前にモデルサイズとトレーニングを計画できる。 5. Chinchilla ほとんどの大規模モデルがアンダートレーニングだったことと、パラメータあたり約20トークンがコンピュート最適であることを学ぶ。固定のコンピュートバジェットでは、少ないトークンでトレーニングした大きいモデルより、多くのトークンでトレーニングした小さいモデルの方が勝る。 6. InstructGPT ChatGPTを支えるメソッドの背後にある論文。RLHFが教師あり微調整、報酬モデリング、PPOを通じて生のテキスト予測器をどのように有用なアシスタントに変えるかを学ぶ。 7. Chain-of-Thought Prompting モデルにステップバイステップで考えるよう求めることが、数学・論理・多段階問題における推論を劇的に改善する仕組みを学ぶ。 8. Retrieval-Augmented Generation RetrieverとGeneratorを組み合わせることで、再トレーニングなしに新鮮で事実に基づいた文書を使ってモデルが回答できるようにする方法を学ぶ。 9. LoRA: Low-Rank Adaptation LoRAが完全な重みの代わりに小さなrank-decomposition行列をトレーニングすることで、訓練可能なパラメータを10,000分の1に削減する方法を学ぶ。これはQLoRAが後に単一GPUで70Bモデルをファインチューニングするために使用した基盤だ。 10. LLaMA よくトレーニングされた13BモデルがほとんどのベンチマークでどのようにGPT-3を上回れるか、そしてオープンウェイトが研究の風景全体をどのように変えたかを学ぶ。 11. FlashAttention IO-awareなattentionが数学を変えることなくメモリ使用量を削減してトレーニングを高速化する方法を学ぶ。この論文を解読した記事を読めます： Amit Shekhar @amitiitbhu · Apr 11 記事: LLMにおけるFlash Attentionを解読する Flash Attentionをパーツごとに解読することで学びます。標準的なattentionがなぜ遅いか、Flash Attentionがなぜ速いか、GPUメモリをどのように巧みに使うか... 12. DPO: Direct Preference Optimization 報酬モデルなし・強化学習なしで、選好データに直接基づいてモデルをアラインする方法を学ぶ。

原文を表示 / Show original

1. Attention Is All You Need The paper that introduced the Transformer. We will learn how self-attention replaces recurrence, and why this one architecture powers every modern LLM. I decoded this paper, you can read these articles: Amit Shekhar @amitiitbhu · Apr 7 Article Decoding Transformer Architecture In this blog, we will learn about the Transformer architecture by decoding it piece by piece - understanding what each component does, how they work together, and why this architecture powers every... 5 115 615 73K Amit Shekhar @amitiitbhu · Apr 3 Article Math behind Attention - Q, K, and V In this blog, we will learn about the math behind Attention: Query(Q), Key(K), and Value(V) with a step-by-step numeric example. We will cover the following: The Attention Formula Setting Up: From... 9 224 1.3K 192K Amit Shekhar @amitiitbhu · Apr 5 Article Math behind √dₖ Scaling Factor in Attention In this blog, we will learn about why we scale the dot product attention by √dₖ in the Transformer architecture with a step-by-step numeric example. If you have already read our blog on Math behind... 3 56 347 71K Amit Shekhar @amitiitbhu · Apr 13 Article Feed-Forward Networks in LLMs In this blog, we will learn about Feed-Forward Networks in LLMs - understanding what they are, how they work inside the Transformer architecture, why every Transformer layer needs one, and what role... 1 17 85 18K 2. BERT We will learn masked language modeling and why bidirectional context made BERT the default for understanding and classification tasks. 3. GPT-3: Language Models are Few-Shot Learners We will learn how a 175B-parameter decoder-only model unlocks in-context learning, picking up new tasks from just a few examples in the prompt. 4. Scaling Laws for Neural Language Models We will learn how loss decreases predictably with compute, data, and parameters, so we can plan model size and training before burning a single GPU hour. 5. Chinchilla We will learn that most large models were undertrained, and that roughly 20 tokens per parameter is compute-optimal. At a fixed compute budget, a smaller model trained on more tokens beats a bigger one trained on less. 6. InstructGPT The paper behind the methods that power ChatGPT. We will learn how RLHF turns a raw text predictor into a helpful assistant through supervised fine-tuning, reward modeling, and PPO. 7. Chain-of-Thought Prompting We will learn how asking a model to think step by step dramatically improves reasoning on math, logic, and multi-step problems. 8. Retrieval-Augmented Generation We will learn how combining a retriever with a generator lets models answer using fresh, factual documents, without retraining. 9. LoRA: Low-Rank Adaptation We will learn how LoRA cuts trainable parameters by 10,000x by training tiny rank-decomposition matrices instead of the full weights. This is the foundation that QLoRA later used to fine-tune 70B models on a single GPU. 10. LLaMA We will learn how a well-trained 13B model can outperform GPT-3 on most benchmarks, and how open weights reshaped the entire research landscape. 11. FlashAttention We will learn how IO-aware attention cuts memory usage and speeds up training, without changing the math. I decoded this paper, you can read the article: Amit Shekhar @amitiitbhu · Apr 11 Article Decoding Flash Attention in LLMs In this blog, we will learn about Flash Attention by decoding it piece by piece - understanding why standard attention is slow, what makes Flash Attention fast, how it uses GPU memory cleverly, and... 21 109 19K 12. DPO: Direct Preference Optimization We will learn how to align a model on preference data directly, without a reward model and without reinforcement learning.

AIFCC — AI Fluent CxO Club

読み書きそろばん、AI。経営者が AI を自分で動かせるようになるコミュニティ。

他の記事を見る AIFCC について