TCN vs RNN vs LSTM vs GRU: https://arxiv.org/pdf/1803.01271.pdf
Vs Transformer (used in LLM) (https://www.sabrepc.com/blog/deep-learning-and-ai/rnns-vs-lstm-vs-transformers) , (https://en.m.wikipedia.org/wiki/Transformer_(deep_learning_architecture)
====ChatGPT:====
What concept does Transformer use instead of loop as in lstm?
Transformers replace the recurrent loops used in LSTM (Long Short-Term Memory) networks with a mechanism called **self-attention** (or attention mechanism) that allows them to process input sequences in parallel, rather than sequentially.
In LSTMs, the input is processed one timestep at a time, which introduces dependencies between time steps. This sequential nature can make LSTMs slower for long sequences because they require processing each token in the sequence one after the other.
In contrast, the Transformer model (https://youtu.be/ZXiruGOCn9s?si=1WbEjOVKBUs4Z-_J) uses the following key components to avoid loops:
1. **Self-Attention Mechanism**: Each token in the input sequence attends to all other tokens in the sequence. This allows the model to capture dependencies between tokens regardless of their distance in the sequence, in a single step.
2. **Positional Encoding**: Since the Transformer does not have a natural way to process sequences in order (like LSTMs), it adds positional encodings to the input embeddings. These encodings allow the model to understand the position of each token in the sequence.
3. **Parallelization**: By using self-attention and positional encoding, Transformers process the entire sequence simultaneously, making them much faster and more efficient on modern hardware (especially for long sequences), as they can leverage parallelism more effectively than LSTMs.
What is token above?
In the context of natural language processing (NLP) and Transformers, a **token** is a unit of text that the model processes. Tokens can represent:
1. **Words**: In word-level tokenization, each word in a sentence is treated as a token. For example, in the sentence "I love programming," the tokens would be "I," "love," and "programming."
2. **Subwords**: In subword-level tokenization, words are split into smaller components, especially for rare or complex words. For example, the word "unbelievable" might be tokenized as "un," "believ," and "able."
3. **Characters**: In character-level tokenization, each individual character is treated as a token. For instance, the word "hello" would be tokenized into 'h,' 'e,' 'l,' 'l,' 'o.'
Tokens are the basic units the model works with. Before processing text, the model converts raw input text into tokens, which are then transformed into numerical representations (embeddings) for computation.
Does transformer use a sliding window like concept?
No, Transformers do not inherently use a **sliding window** concept like traditional sequence models or convolutional models. Instead, they rely on the **self-attention mechanism** to capture relationships between tokens across the entire input sequence.
However, in certain contexts where input sequences are very long, it can be computationally expensive to apply self-attention over the entire sequence. To address this, some **Transformer variants** introduce mechanisms that resemble the sliding window approach. Examples include:
1. **Sliding Attention (Local Attention)**: Some models limit the attention mechanism to a local window around each token, rather than attending to the entire sequence. This approach reduces the computational cost by focusing on nearby tokens, similar to how a sliding window would.
2. **Segmented Attention**: In models like **Longformer** and **BigBird**, self-attention is applied over smaller segments of the sequence (local windows) while maintaining some global attention to preserve connections across distant parts of the sequence.
3. **Reformer**: This model uses **locality-sensitive hashing** (LSH) to reduce the number of tokens that need to attend to each other, which mimics focusing attention within a smaller "window."
While the basic Transformer model does not use a sliding window, these adaptations help handle long sequences efficiently by introducing window-like behavior in the attention mechanism.