วันพุธที่ 7 สิงหาคม พ.ศ. 2567

NLP research method

 https://kinoshita.eti.br/2017/06/03/natural-language-processing-and-natural-language-understanding.html

https://en.m.wikipedia.org/wiki/Natural_language_processing













Research methods in Natural Language Processing (NLP) have evolved from rule-based linguistics to data-driven statistical models and, most recently, to deep learning architectures. Because NLP sits at the intersection of linguistics, computer science, and statistics, its research methodologies are highly structured and iterative.

Here is an overview of the core research pipeline in modern NLP:


1. Problem Formulation and Data Collection

Research typically begins by identifying a specific task (e.g., Sentiment Analysis, Machine Translation, or Question Answering).

  • Corpus Acquisition: Gathering a large body of text. This can be from web scraping (Common Crawl), specialized datasets (Wikipedia, news archives), or proprietary domain-specific data.

  • Data Annotation: If the research involves supervised learning, human experts must label the data (e.g., tagging parts of speech or identifying "ground truth" answers).

2. Data Preprocessing (The Cleaning Phase)

Raw text is messy and must be standardized before it can be processed by a model.

  • Tokenization: Breaking sentences into individual words or sub-words.

  • Normalization: Lowercasing, removing punctuation, or "Stemming/Lemmatization" (reducing words like "running" to "run").

  • Stop-word Removal: Filtering out common words like "the" or "is" that may not carry significant semantic weight for certain tasks.

3. Feature Engineering 

  • Feature Extraction: Creating new features from raw data (e.g., pulling "Day of the Week" from a raw "Timestamp" or using PCA to condense 100 variables into 5). Feature extaction aka word embedding includes vectorization using Bag of word, word2vec, TF-IDF. In this stage, the researcher decides which characteristics of the text are most informative for the target task.
    • Lexical Features: Extracting specific keywords, N-grams (word sequences), or morphological roots
    • Structural Features: Using Part-of-Speech (PoS) tagging or Dependency Parsing to extract the grammatical relationship between words.
    • Dimensionality Reduction: Techniques like LDA (Latent Dirichlet Allocation) might be used to extract "topics" from a large document set, reducing thousands of words to a few key themes.
  • Feature Representation: Choosing the mathematical format for those features (e.g., turning categories into binary numbers or text into dense vectors). Once features are identified, they must be represented in a format a computer can optimize.
    • Sparse Representation: Using Bag of Words (BoW), One-Hot Encoding or TF-IDF vectors. These are high-dimensional and "sparse" because most values are zero.
    • Dense Representation (Embeddings): Mapping extracted features into a continuous vector space (e.g., Word2Vec, GloVe). This is crucial for capturing semantic similarity—ensuring that terms like "optimization" and "efficiency" are mathematically adjacent.
    • Contextual Representation: Modern research often uses Transformers to create dynamic representations where a word's vector changes based on the surrounding text (e.g., "bank" of a river vs. investment "bank").
    • Feature Selection: Picking the most important features and discarding the noise to prevent "overfitting."
    • Feature Transformation: Scaling or normalizing data (e.g., making sure a "Price" feature and an "Age" feature are on the same scale, like 0 to 1).

4.Feature Modeling

This is where the distinction between "Traditional" and "Modern" NLP is most visible.

  • Statistical/Traditional Methods: Researchers manually define features using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or N-grams. Models like Hidden Markov Models (HMM) or Support Vector Machines (SVM) are often used here.

  • Deep Learning Methods: Researchers design neural architectures (e.g., Transformers, LSTMs) that automatically learn features through layers. Current research focuses heavily on Large Language Models (LLMs) and Self-Supervised Learning, where the model learns from unlabeled text by predicting missing words.

4. Training and Optimization

  • Objective Functions: Defining a loss function (like Cross-Entropy Loss) that measures how far the model's prediction is from the truth.

  • Hyperparameter Tuning: Adjusting learning rates, batch sizes, and model depth to optimize performance.

  • Transfer Learning: Taking a pre-trained model (like BERT or GPT) and "fine-tuning" it on a specific, smaller dataset for a specialized task.

5. Evaluation Metrics

To determine if a research method is successful, researchers use standardized quantitative metrics:

  • Accuracy/F1-Score: Common for classification tasks.

  • BLEU/ROUGE: Used for translation and summarization to compare machine output against human references.

  • Perplexity: Measures how well a probability model predicts a sample.

  • Human Evaluation: For tasks like creative writing or reasoning, researchers often employ human "judges" to score the output on fluency and factual correctness.

NLP model developing steps by ChatGPT:

To develop an NLP model using the terms provided, the process generally follows these steps:

1. **Data Collection**: Gather and prepare a dataset of text that will be used for training and testing the NLP model.

2. **Tokenization**:

   - **Explanation**: Split the text into smaller units called tokens, which can be words, subwords, or characters. 

   - **Purpose**: Tokenization allows the model to process and analyze text at a granular level.

   - **Example**: "The cat sat on the mat." becomes ["The", "cat", "sat", "on", "the", "mat"].

3. **Stop Words Removal**:

   - **Explanation**: Remove common words that have little meaning on their own, such as "the," "is," and "and."

   - **Purpose**: Reduce noise in the data, focusing the model on more meaningful words.

   - **Example**: After removal, ["The", "cat", "sat", "on", "the", "mat"] might become ["cat", "sat", "mat"].

4. **Stemming**:

   - **Explanation**: Reduce words to their root form by removing suffixes (e.g., "running" → "run").

   - **Purpose**: Simplify words to a common base form, reducing vocabulary size.

   - **Example**: "running", "runner", "ran" all stem to "run".

5. **Lemmatization**:

   - **Explanation**: Similar to stemming, but it reduces words to their base or dictionary form, known as the lemma, considering the context.

   - **Purpose**: Ensure words are reduced to their meaningful base form, which may differ based on context.

   - **Example**: "better" → "good", "running" → "run".

6. **Feature Extraction**:

- **Explanation**: Convert tokens into numerical features that the model can understand.

- **Methods**:

  - **Bag of Words (BoW)**: Represents text by the frequency of words in the document. It ignores the context.

  - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Adjusts the frequency of words by their importance across documents. (https://bdi.or.th/big-data-101/tf-idf-1/)

  - **Word Embedding**: An advanced method that transforms words into dense vectors capturing semantic relationships between words surrounding (i.e., context). Common methods include Word2Vec, GloVe, and FastText.

- **Purpose**: Transform text data into a format suitable for modeling, whether through simple frequency counts or more complex vector representations.

7. **Modeling with Deep Learning Algorithms**:

   - **Explanation**: Use deep learning techniques to build the NLP model.

   - **Purpose**: Leverage complex neural networks to capture patterns and relationships in text data.

   - **Common Models**:

     - **RNN (Recurrent Neural Network)**: Suitable for sequence-based tasks like text generation.

     - **LSTM (Long Short-Term Memory)**: An advanced form of RNN that handles long-term dependencies.

     - **Transformer**: State-of-the-art model architecture for NLP tasks (e.g., BERT, GPT).

8. **Model Training**:

   - **Explanation**: Train the deep learning model using the processed text data.

   - **Purpose**: Optimize model parameters to minimize error and improve accuracy.

9. **Evaluation**:

    - **Explanation**: Assess the model's performance on a validation set.

    - **Purpose**: Ensure the model generalizes well to unseen data.

10. **Deployment**:

    - **Explanation**: Integrate the trained model into a production environment.

    - **Purpose**: Make the model available for practical use.

11. **Monitoring and Maintenance**:

    - **Explanation**: Continuously monitor the model's performance and update it as needed.

    - **Purpose**: Ensure the model remains accurate and relevant over time

Example implementation of ChatBot using LSTM

https://medium.com/@newnoi/%E0%B8%A1%E0%B8%B2%E0%B8%AA%E0%B8%A3%E0%B9%89%E0%B8%B2%E0%B8%87-chatbot-%E0%B9%81%E0%B8%9A%E0%B8%9A%E0%B9%84%E0%B8%97%E0%B8%A2%E0%B9%86-%E0%B8%94%E0%B9%89%E0%B8%A7%E0%B8%A2-machine-learning-lstm-model-%E0%B8%81%E0%B8%B1%E0%B8%99%E0%B8%94%E0%B8%B5%E0%B8%81%E0%B8%A7%E0%B9%88%E0%B8%B2-part1-6230eac8d1f8

https://medium.com/@newnoi/%E0%B8%AA%E0%B8%AD%E0%B8%99%E0%B8%84%E0%B8%AD%E0%B8%A1%E0%B8%9E%E0%B8%B9%E0%B8%94%E0%B9%81%E0%B8%9A%E0%B8%9A%E0%B9%84%E0%B8%97%E0%B8%A2%E0%B9%86-%E0%B8%94%E0%B9%89%E0%B8%A7%E0%B8%A2-machine-learning-model-part2-2a1609af1bd7