🧠 Embeddings & Tokens — เอมเบดดิ้งและโทเคน

🧩 โทเคนคืออะไร? — What is a Token?

🇹🇭 ภาษาไทย

ก่อนที่ AI จะเข้าใจข้อความของคุณ มันจะต้องหั่นข้อความออกเป็นชิ้นเล็กๆ ก่อน ชิ้นเล็กๆ เหล่านี้เรียกว่า โทเคน (token)

โทเคนหนึ่งตัวอาจเป็นคำทั้งคำ เช่น cat หรือเป็นส่วนของคำ เช่น un + believ + able สำหรับคำที่ยาวหรือไม่ค่อยใช้

ทุกโทเคนมี หมายเลข ID ของตัวเองในพจนานุกรมของโมเดล (ปกติมี 50,000–200,000 โทเคน)

🇬🇧 English

Before AI can understand your text, it first chops the text into small pieces. These pieces are called tokens.

A token can be a whole word like cat, or a piece of a word like un + believ + able for longer or rare words.

Every token has its own ID number in the model's vocabulary (usually 50,000–200,000 tokens).

🧱 Analogy — การเปรียบเทียบ

Think of tokens like LEGO bricks. You can't hand a whole sentence to a computer the way you hand it to a person — so the model breaks language into standard-sized bricks that it can build with.

โทเคนก็เหมือนLEGO bricks (ตัวต่อเลโก้) — คุณส่งประโยคทั้งประโยคให้คอมพิวเตอร์แบบเดียวกับคนไม่ได้ ดังนั้นโมเดลจึงหั่นภาษาออกเป็น "ก้อนอิฐ" ขนาดมาตรฐานที่เอาไปต่อได้

🎮 Try it — ลองเล่นดู

พิมพ์ประโยคในช่องด้านล่าง และดูว่าแต่ละโทเคนถูกสร้างขึ้นอย่างไร — Type a sentence below and watch how tokens are created:

Your sentence — ประโยคของคุณ:

💡 Notice — สังเกต

EN: Common words like cat become one token. But a longer, rarer word like unbelievable might be split into pieces. This way, the model can handle any word — even ones it has never seen — by building it from known parts.

TH: คำที่ใช้บ่อยอย่าง cat จะเป็นหนึ่งโทเคน แต่คำที่ยาวหรือไม่ค่อยใช้อย่าง unbelievable อาจถูกแบ่งออก วิธีนี้ทำให้โมเดลสามารถจัดการคำใดก็ได้ — แม้แต่คำที่ไม่เคยเห็นมาก่อน — โดยการประกอบจากส่วนที่รู้จัก

🔄 From text to numbers — จากข้อความสู่ตัวเลข

Raw text"The cat sat"

→

Tokens[The] [cat] [sat]

→

Token IDs[464, 2415, 3332]

📍 เอมเบดดิ้งคืออะไร? — What are Embeddings?

🇹🇭 ภาษาไทย

หมายเลข ID อย่าง 2415 เป็นแค่ "เลขที่ในตู้เก็บของ" — มันไม่มีความหมายในตัวเอง 2415 ไม่ได้คล้ายกับ 2416 มากกว่า 98421 เลย

โมเดลจึงต้องการวิธีแทนความหมาย — นั่นคือที่มาของ เอมเบดดิ้ง (embedding) ทุกโทเคนจะได้รับรายการตัวเลขยาวๆ เรียกว่า เวกเตอร์ (vector) โดยทั่วไปมี 768, 1536 หรือ 4096 ตัวเลข

ตัวเลขเหล่านี้ไม่ได้ถูกคนกำหนด แต่โมเดลเรียนรู้เองจากการอ่านข้อความมหาศาล โดยจะดันคำที่ปรากฏในบริบทคล้ายกันให้อยู่ใกล้กัน

🇬🇧 English

A token ID like 2415 is just a locker number — it has no meaning on its own. 2415 is no more similar to 2416 than it is to 98421.

So the model needs a way to represent meaning. That's where embeddings come in. Every token gets a long list of numbers called a vector — usually 768, 1536, or 4096 numbers.

These numbers aren't set by humans. The model learns them by reading huge amounts of text, pushing words that appear in similar contexts close together.

🔢 What an embedding actually looks like — เอมเบดดิ้งหน้าตาเป็นยังไง

"cat" →[ 0.21, -0.44, 0.87, 0.12, -0.05, 0.33, -0.71, 0.58, ... 1536 numbers ]

"dog" →[ 0.19, -0.41, 0.85, 0.15, -0.08, 0.29, -0.68, 0.61, ... 1536 numbers ]

"pizza" →[-0.53, 0.71, -0.12, 0.88, 0.42, -0.19, 0.33, -0.55, ... 1536 numbers ]

🗺️ Analogy — การเปรียบเทียบ

EN: Think of each vector as coordinates on a giant map of meaning. Instead of 2 coordinates like latitude/longitude, the map has hundreds of dimensions. Words with similar meanings end up close together — cat and dog are neighbors; pizza is far away.

TH: คิดซะว่าเวกเตอร์แต่ละตัวคือพิกัดบนแผนที่ความหมายขนาดยักษ์ แทนที่จะมีแค่ 2 พิกัด (ละติจูด-ลองจิจูด) แผนที่นี้มีหลายร้อยมิติ คำที่ความหมายคล้ายกันจะลงเอยอยู่ใกล้กัน — cat กับ dog เป็นเพื่อนบ้านกัน ส่วน pizza อยู่ไกลโพ้น

🗺️ แผนที่ความหมาย — The Meaning Map

🇹🇭 ภาษาไทย

เอมเบดดิ้งจริงๆ มีหลายร้อยมิติ ซึ่งวาดไม่ได้ แต่เราสามารถบีบลงมาเป็น 2 มิติเพื่อให้เห็นภาพได้ ข้างล่างคือแผนที่ความหมายแบบย่อ — คำที่เกี่ยวข้องกันจะรวมกลุ่มกัน

🇬🇧 English

Real embeddings live in hundreds of dimensions — impossible to draw. But we can squish them down to 2D so you can see. Below is a simplified meaning map. Related words form clusters.

Animals — สัตว์ Food — อาหาร Royalty & People — ราชวงศ์และคน Vehicles — ยานพาหนะ Emotions — อารมณ์

💡 What to notice — สังเกตอะไร

EN: Words in the same category cluster together. cat, dog, rabbit are neighbors. happy, joyful, excited form their own group. The model wasn't told any of this — the patterns emerged from reading billions of sentences.

TH: คำในหมวดเดียวกันรวมกลุ่มกัน cat, dog, rabbit เป็นเพื่อนบ้าน happy, joyful, excited มีกลุ่มของตัวเอง โมเดลไม่ได้ถูกบอกเรื่องพวกนี้เลย — รูปแบบพวกนี้ผุดขึ้นมาเองจากการอ่านประโยคเป็นพันล้านประโยค

📏 วัดความคล้าย — Measuring Similarity

🇹🇭 ภาษาไทย

เนื่องจากเอมเบดดิ้งเป็นแค่ตัวเลข เราสามารถคำนวณว่าสองคำคล้ายกันแค่ไหนได้ด้วยคณิตศาสตร์ เครื่องมือมาตรฐานคือ โคไซน์ซิมิลาริตี้ (cosine similarity) ซึ่งวัดมุมระหว่างเวกเตอร์สองตัว

ใกล้ 1.0 = ความหมายคล้ายมาก ใกล้ 0 = ไม่เกี่ยวข้องกัน

🇬🇧 English

Since embeddings are just numbers, we can calculate how similar two words are with math. The standard tool is cosine similarity — it measures the angle between two vectors.

Close to 1.0 = very similar meaning. Close to 0 = unrelated.

🎮 Try it — ลองเล่นดู

พิมพ์คำในช่องด้านล่าง (ตัวอย่าง: cat, pizza, happy, car, king) — Type a word below:

Word to compare — คำที่ต้องการเปรียบเทียบ:

🌟 Real-world uses — การใช้งานจริง

Semantic search — ค้นหาตามความหมาย: Google ไม่ได้หาแค่ตัวอักษรตรงกัน แต่หาสิ่งที่ "ความหมายใกล้เคียง" ด้วย
Recommendation systems — ระบบแนะนำ: Netflix ใช้เอมเบดดิ้งของหนังที่คุณชอบเพื่อแนะนำหนังที่มีเวกเตอร์คล้ายกัน
RAG (Retrieval-Augmented Generation) — ระบบค้นข้อมูลแล้วสร้างคำตอบ: ChatGPT กับเอกสารของบริษัท ใช้เอมเบดดิ้งเพื่อหาเอกสารที่เกี่ยวข้อง
Chatbots like Claude & ChatGPT — แชทบอท: ใช้เอมเบดดิ้งเพื่อ "เข้าใจ" คำถามของคุณ

✨ คณิตศาสตร์บนความหมาย — Math on Meaning

🇹🇭 ภาษาไทย

เพราะความหมายอยู่ในพื้นที่พิกัด เราสามารถทำคณิตศาสตร์กับคำได้ ตัวอย่างคลาสสิกคือ:

🇬🇧 English

Because meaning lives in a coordinate space, you can do math on words. The classic example:

king − man + woman ≈ queen

🇹🇭 คำอธิบาย

ลบความ "เป็นชาย" ออกจาก "king" แล้วเพิ่มความ "เป็นหญิง" เข้าไป — เราจะมาหยุดใกล้ๆ "queen"

โมเดลเรียนรู้เรื่องนี้ได้โดยไม่มีใครบอกว่าเพศหรือราชวงศ์คืออะไร มันแค่สังเกตเห็นรูปแบบจากการที่คำเหล่านี้ถูกใช้

🇬🇧 Explanation

Subtract the "man-ness" from "king", add the "woman-ness", and you land near "queen".

The model learned this without being told what gender or royalty are. It just noticed the pattern in how these words get used.

💡 Why this matters — ทำไมถึงสำคัญ

EN: Embeddings don't just memorize words — they capture relationships. The same direction that turns king → queen also turns uncle → aunt, actor → actress, and prince → princess. Meaning has structure, and embeddings discover it.

TH: เอมเบดดิ้งไม่ได้แค่ท่องจำคำ — แต่มันจับความสัมพันธ์ได้ด้วย ทิศทางเดียวกันที่เปลี่ยน king → queen ก็เปลี่ยน uncle → aunt, actor → actress, และ prince → princess ได้ด้วย ความหมายมีโครงสร้าง และเอมเบดดิ้งค้นพบมัน

🎯 ภาพรวมทั้งหมด — The Full Picture

นี่คือเส้นทางเต็มจากข้อความที่คุณพิมพ์ไปจนถึงตัวเลขที่โมเดลใช้จริง — Here's the full journey from your typed text to the numbers a model actually works with:

Text"Hello world"

→

Tokens[Hello] [world]

→

Token IDs[15496, 995]

→

Embeddings[0.2, -0.4, ...]

→

AI Modeldoes its magic

🎓 Key Takeaways — สรุปประเด็นสำคัญ

Tokens (โทเคน) are the pieces language gets broken into — ชิ้นส่วนที่ภาษาถูกหั่นออกมา
Token IDs are just lookup numbers — เป็นแค่เลขในตู้เก็บของ ไม่มีความหมายในตัวเอง
Embeddings (เอมเบดดิ้ง) turn each token into a vector that encodes meaning — เปลี่ยนแต่ละโทเคนเป็นเวกเตอร์ที่เก็บความหมาย
Similar meanings = similar vectors — ความหมายใกล้ = เวกเตอร์ใกล้ นี่คือสิ่งที่ทำให้ AI "เข้าใจ" ภาษา
You can do math on meaning — ทำคณิตศาสตร์กับความหมายได้ เวกเตอร์เลขคณิตเผยความสัมพันธ์ที่เรียนรู้มา
Every modern AI — ChatGPT, Claude, Google, Netflix — is built on this idea — ทุก AI สมัยใหม่สร้างบนแนวคิดนี้

📖 คำศัพท์สำคัญ — Glossary

คลิกที่ปุ่ม 🔊 เพื่อฟังคำอ่าน — Click the 🔊 button to hear the pronunciation:

token

/ˈtoʊ.kən/

โทเคน (ชิ้นส่วนของข้อความ)

ชิ้นเล็กๆ ที่ AI ใช้ประมวลผล อาจเป็นคำเดียวหรือส่วนของคำ — A small piece of text the AI processes. Can be a whole word or a part of a word.

tokenization

/ˌtoʊ.kə.naɪˈzeɪ.ʃən/

การแบ่งข้อความเป็นโทเคน

ขั้นตอนการหั่นข้อความออกเป็นโทเคน — The process of breaking text into tokens before feeding it to an AI model.

vocabulary

/vəˈkæb.jə.ler.i/

พจนานุกรมของโมเดล

ชุดโทเคนทั้งหมดที่โมเดลรู้จัก ปกติ 50,000–200,000 ตัว — The complete set of tokens a model knows, usually 50,000–200,000.

embedding

/ɪmˈbed.ɪŋ/

เอมเบดดิ้ง (เวกเตอร์แทนความหมาย)

รายการตัวเลขยาวๆ ที่แทนความหมายของโทเคน — A long list of numbers that represents the meaning of a token in a way computers can use.

vector

/ˈvek.tər/

เวกเตอร์ (รายการตัวเลข)

รายการตัวเลข ในเอมเบดดิ้งมักยาว 768–4096 ตัว — A list of numbers. In embeddings, usually 768–4096 numbers long.

dimension

/daɪˈmen.ʃən/

มิติ

ตัวเลขแต่ละตัวในเวกเตอร์คือ "มิติ" หนึ่ง โมเดลใหญ่ๆ อยู่ในพื้นที่หลายพันมิติ — Each number in a vector is one dimension. Big models live in thousands of dimensions.

similarity

/ˌsɪm.əˈler.ə.ti/

ความคล้ายคลึง

ความใกล้เคียงของความหมายระหว่างสองคำ วัดจากระยะห่างของเวกเตอร์ — How close two words are in meaning, measured by the distance of their vectors.

cosine similarity

/ˈkoʊ.saɪn ˌsɪm.əˈler.ə.ti/

โคไซน์ซิมิลาริตี้

วิธีวัดความคล้ายโดยดูมุมระหว่างเวกเตอร์ 1.0 = เหมือนสุด, 0 = ไม่เกี่ยวกัน — Measures similarity by the angle between vectors. 1.0 = identical, 0 = unrelated.

cluster

/ˈklʌs.tər/

กลุ่ม / คลัสเตอร์

กลุ่มของคำที่มีความหมายใกล้กันและอยู่ใกล้กันในพื้นที่เอมเบดดิ้ง — A group of words with similar meanings that end up near each other in embedding space.

context

/ˈkɑn.tekst/

บริบท

คำรอบข้างที่ช่วยบอกความหมาย โมเดลเรียนรู้เอมเบดดิ้งโดยดูว่าคำถูกใช้ในบริบทอะไร — The surrounding words that help give meaning. Models learn embeddings by seeing which contexts each word appears in.

subword

/ˈsʌb.wɜːd/

ส่วนของคำ

ชิ้นของคำที่เล็กกว่าคำเต็ม เช่น "un" + "believ" + "able" ใช้สำหรับคำยาวหรือไม่ค่อยใช้ — A piece of a word smaller than a whole word. Used for long or rare words.

semantic search

/sɪˈmæn.tɪk sɜːrtʃ/

ค้นหาตามความหมาย

การค้นที่ใช้เอมเบดดิ้งเพื่อหาผลลัพธ์ที่ "ความหมายใกล้" ไม่ใช่แค่ตัวอักษรตรงกัน — Search that uses embeddings to find results with similar meaning, not just matching letters.

vector arithmetic

/ˈvek.tər əˈrɪθ.mə.tɪk/

เลขคณิตของเวกเตอร์

การบวก/ลบเวกเตอร์ของคำ เช่น king − man + woman ≈ queen — Adding and subtracting word vectors, e.g., king − man + woman ≈ queen.

neural network

/ˈnʊr.əl ˈnet.wɜːk/

เครือข่ายประสาทเทียม

ระบบคณิตศาสตร์ที่เลียนแบบสมอง ใช้สำหรับเรียนรู้เอมเบดดิ้งจากข้อมูล — A math system inspired by the brain, used to learn embeddings from data.

training

/ˈtreɪ.nɪŋ/

การฝึกสอนโมเดล

ขั้นตอนที่โมเดลเรียนรู้เอมเบดดิ้งจากข้อความมหาศาลหลายพันล้านประโยค — The process where a model learns embeddings by reading billions of sentences.

pattern

/ˈpæt.ərn/

รูปแบบ

รูปแบบที่เกิดขึ้นซ้ำๆ ในข้อมูล เอมเบดดิ้งเกิดจากการที่โมเดลสังเกตเห็นรูปแบบเหล่านี้ — Repeating shapes in data. Embeddings arise from the model noticing these patterns.