Learn more at: en.wikipedia.org/wiki/BERT_(language_model))
Content derived and adapted from Wikipedia articles, licensed under CC BY-SA.
BERT (Bidirectional Encoder Representations from Transformers): A transformer-based model pre-trained on a large corpus of text data to understand linguistic context effectively using a deeply bidirectional neural network.
Transformer: A neural network architecture that uses self-attention mechanisms to weigh the significance of different words in a sequence to enhance language processing tasks.
Tokenizer: A component that converts text into a sequence of integers or tokens, enabling models like BERT to process language data.
Embedding: A representation of tokens in a lower-dimensional continuous vector space, facilitating operations on text data within a model.
Encoder: Part of the transformer architecture that processes input data by leveraging self-attention to understand context.
Token Type Embeddings: Vectors that convert one-hot encoded token vectors into dense representations based on their token type.
Position Embeddings: Encodes the position of each token within a sequence using vectors based on sinusoidal functions.
Segment Type Embeddings: Dense vectors that distinguish text segments within input sequences using binary encoding.
LayerNorm: A normalization technique applied to vectors to maintain numerical stability and performance across layers.
Masked Language Model (MLM): A training task where BERT predicts obscured words within a sentence to understand linguistic context bidirectionally.
Next Sentence Prediction (NSP): A training task where BERT evaluates if one sentence logically follows another to capture inter-sentence relationships.
WordPiece: A sub-word tokenization technique used to handle out-of-vocabulary tokens by breaking words into smaller pieces.
[MASK] Token: A special token used during training to replace certain words, challenging the model to predict the missing words using context.
Attention Mechanism: Part of a transformer model that allows it to concentrate on specific parts of input data, crucial for understanding context.
Feed-forward Network: A neural network layer in transformers that processes outputs from attention layers to generate a final representation.
RoBERTa: A BERT variant that improves performance by optimizing hyperparameters, removing certain tasks, and using larger mini-batches.
DistilBERT: A compressed version of BERT that retains most of BERT's performance while reducing model size for efficiency.
TinyBERT: An efficient version of BERT with a significantly reduced parameter count aimed at maintaining strong performance metrics.
ALBERT: A BERT variant that introduces parameter sharing across layers and focuses on differentiating sentence order.
ELECTRA: A BERT-inspired model using a discriminator-generator framework to enhance language model training via adversarial techniques.
DeBERTa: A BERT variant employing disentangled attention to separate positional and token encodings in its attention mechanism.
TPU (Tensor Processing Unit): A hardware accelerator designed by Google to accelerate machine learning workloads specific to neural network training.
BERTBASE: A standard version of BERT consisting of 12 layers and 768 hidden dimensions for general use in language tasks.
BERTLARGE: An enlarged version of BERT with 24 layers and a hidden size of 1024 dimensions, offering enhanced performance.
BERTTINY: A miniaturized version of BERT designed to maximize efficiency with 2 layers and a reduced hidden size.
Global Pooling: A term from computer vision analogous to BERT's "pooler layer," referring to consolidating information into a single output representation.
Attention Weights: Numerical values representing the significance of different words in a sequence, used in the attention mechanism to comprehend context.
Fine-tuning: The process of training a pre-trained model like BERT on a specific task with additional data and task-specific adjustments.
GLUE (General Language Understanding Evaluation): A benchmark for assessing the performance of natural language processing models across multiple tasks.
SQuAD (Stanford Question Answering Dataset): A reading comprehension dataset for evaluating models' question-answering capabilities.
SWAG (Situations With Adversarial Generations): A benchmark for evaluating the coherence and reasoning abilities of models in textual scenarios.
コメント