Course Hive
Search

Welcome

Sign in or create your account

Continue with Google
or
Masked Self Attention | Masked Multi-head Attention in Transformer | Transformer Decoder
Play lesson

100 Days of Deep Learning - Masked Self Attention | Masked Multi-head Attention in Transformer | Transformer Decoder

4.0 (0)
5 learners

What you'll learn

This course includes

  • 52 hours of video
  • Certificate of completion
  • Access on mobile and TV

Summary

Keywords

Full Transcript

Masked Multi-head Attention is used in transformer models to ensure that each token in a sequence only attends to previous tokens and itself, not future tokens. This masking is essential for autoregressive tasks like language generation, enabling the model to generate sequences one step at a time without peeking ahead. By employing multiple attention heads, the model captures diverse contextual information while maintaining the sequential integrity. Notes: https://learnwith.campusx.in/s/store/courses/YouTube%20Notes ============================ Did you like my teaching style? Check my affordable mentorship program at : https://learnwith.campusx.in DSMP FAQ: https://docs.google.com/document/d/1OsMe9jGHoZS67FH8TdIzcUaDWuu5RAbCbBKk2cNq6Dk/edit#heading=h.gvv0r2jo3vjw ============================ 📱 Grow with us: CampusX' LinkedIn: https://www.linkedin.com/company/campusx-official Slide into our DMs: https://www.instagram.com/campusx.official My LinkedIn: https://www.linkedin.com/in/nitish-singh-03412789 Discord: https://discord.gg/PsWu8R87Z8 E-mail us at [email protected] ⌚Time Stamps⌚ 00:00 - Intro 00:46 - Recap 05:25 - Autoregressive Models 17:57 - Transformer as an Autoregressive Model - Proof of concept 35:25 - The problem in Parellelizing 49:47 - Answering the Problem 01:00:15 - Outro

Course Hive

Continue this lesson in the app

Install CourseHive on Android or iOS to keep learning while you move.

Related Courses

FAQs

Course Hive
Download CourseHive
Keep learning anywhere