วันพฤหัสบดีที่ 29 สิงหาคม พ.ศ. 2567

ฟ้าลิขิตไม่มี

การเชื่อฟ้าลิขิตว่าทุกการกระทำเราเกิดจากวิบากกรรมในอดีตคือการเชื่อในกรรมบันดาล

แต่พระพุทธเจ้าตรัสว่ากรรมไม่ได้เกิดจากตนเองบันดาลไม่ได้เกิดจากผู้อื่นบันดาลไม่ได้เกิดจากทั้งตนเองและผู้อื่นบันดาลและไม่ได้เกิดขึ้นเองลอยๆแต่การเป็นไปตามปฏิจจสมุปบาท

Data virtualization

An approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted at source, or where it is physically located, and can provide a single customer view (or single view of any other entity) of the overall data.

Unlike the traditional extract, transform, load ("ETL") process, the data remains in place, and real-time access is given to the source system for the data. This reduces the risk of data errors, of the workload moving data around that may never be used, and it does not attempt to impose a single data model on the data (an example of heterogeneous data is a federated database system). The technology also supports the writing of transaction data updates back to the source systems.

สรุปว่าต่างจาก Data warehouse ทีไม่ต้อง replicate ข้อมูลมารวมใน DW (i.e. ทำแค่ E&T ไม่ทำ L) และยังสามารถ write back to original DBs ได้ด้วย คล้ายๆกับแนวคิดของ Oracle VIEW

วันพุธที่ 28 สิงหาคม พ.ศ. 2567

AI tools









































https://storm.genie.stanford.edu/ generate a Wikipedia-like report on your topic

STORM is a research prototype for automating the knowledge curation process



วันจันทร์ที่ 19 สิงหาคม พ.ศ. 2567

How does ChatGPT work?

ChatGPT consists of transformer (converting input text sequence to output text sequence e.g. by translation using dictionary), LLM (to predict a subsequent word give previous words), etc. It relies on supervised and reinforcement learning techniques.

https://novaapp.ai/blog/technology-behind-chatgpt

https://medium.com/@ashish.sharma1981/chatgpt-architecture-exploring-the-inner-workings-of-the-language-model-41731fc05483

https://www.scalablepath.com/machine-learning/chatgpt-architecture-explained

https://youtu.be/lm_ZBWaK56k?si=I8P_btvKiCFwFL5G


วันพฤหัสบดีที่ 15 สิงหาคม พ.ศ. 2567

Akaike Information Critera (AIC)

Akaike Information Critera (AIC) is a widely used measure of a statistical model. It basically quantifies 1) the goodness of fit, and 2) the simplicity/parsimony, of the model into a single statistic. The lower the AIC, the better the model.

Cf. https://coolstatsblog.com/2013/08/14/using-aic-to-test-arima-models-2/

AIC=2k2ln(L)

  • k is the number of parameters in the model.
  • ln(L) is the natural logarithm of the maximum likelihood estimate for the model, i.e. best fitting model parameters.
The term 2k is a penalty for model complexity. 





















AIC is not generally used for Multi-Layer Perceptrons (MLPs).

The reasons are similar to why it isn't used for LLMs:

Complexity: MLPs are a type of neural network with a large number of weights and biases, even for a relatively small network. These parameters are not easily interpretable, and the 2k penalty term in the AIC formula would become so large that it would make the score meaningless for comparison.

Different Optimization Philosophy: MLPs are optimized through backpropagation to minimize a loss function (like Mean Squared Error or cross-entropy) on a training dataset. They are not typically fit using a maximum likelihood approach that can be easily translated into a likelihood score (L).

Alternative Metrics: The performance of MLPs and other neural networks is evaluated using metrics that are more appropriate for their task, such as accuracy, precision, recall, F1-score, or Mean Squared Error on a separate validation set.

สุภาษิตญีปุ่น "ความต่อเนื่องคือพลัง"

 「継続は力なり」 (Keizoku wa chikara nari) หมายความว่าความมุ่งมั่นและการทำสิ่งต่างๆ อย่างต่อเนื่องสามารถสร้างพลังและความสำเร็จได้ ซึ่งเป็นความคิดที่ได้รับการยอมรับในวัฒนธรรมญี่ปุ่น

วันจันทร์ที่ 12 สิงหาคม พ.ศ. 2567

Genetic algorithm and Evolutionary computing

Genetic Algorithms (GAs) are a core component of evolutionary computing, which is a broader field inspired by the principles of natural evolution. Here's how GAs fit into evolutionary computing:

Evolutionary Computing:

  1. Definition:

    • Evolutionary computing is a class of optimization algorithms inspired by the principles of biological evolution. It includes a variety of techniques that mimic natural selection, genetic processes, and evolution to solve complex optimization problems.
  2. Core Concepts:

    • Natural Selection: The process where organisms better adapted to their environment tend to survive and produce more offspring.
    • Genetic Operators: Techniques such as crossover (recombination) and mutation that simulate biological evolution.

Genetic Algorithms (GAs):

  1. Definition:

    • Genetic Algorithms are a specific type of evolutionary algorithm used to find approximate solutions to optimization and search problems. They mimic the process of natural evolution to evolve solutions over generations.
  2. Components of GAs:

    • Population: A set of potential solutions to the problem, each represented as a chromosome or individual.
    • Selection: A process to choose individuals based on their fitness scores, favoring better solutions for reproduction.
    • Crossover (Recombination): Combines parts of two parent solutions to create new offspring, simulating biological reproduction.
    • Mutation: Introduces random changes to offspring to maintain genetic diversity and explore new areas of the solution space.
    • Fitness Function: Evaluates how well a solution solves the problem, guiding the selection process.
  3. Relation to Evolutionary Computing:

    • Foundational Role: GAs are one of the earliest and most well-known examples of evolutionary algorithms, demonstrating the core principles of evolutionary computing.
    • Broad Category: Evolutionary computing includes other algorithms as well, such as Evolution Strategies (ES), Evolutionary Programming (EP), and Genetic Programming (GP), each with variations on the evolutionary concepts.

In summary, Genetic Algorithms are a specific instantiation of evolutionary computing principles, and they play a significant role in demonstrating how evolutionary concepts can be applied to optimization and search problems.

Well-known optimization methods

 Exact Optimization Methods:

  1. Linear Programming (LP): Optimizes a linear objective function subject to linear equality and inequality constraints.
  2. Integer Programming (IP): A type of linear programming where some or all of the decision variables are required to be integers.
  3. Quadratic Programming (QP): Optimizes a quadratic objective function subject to linear constraints.
  4. Dynamic Programming (DP): Solves problems by breaking them down into simpler subproblems and solving each subproblem just once, storing the results.
  5. Branch and Bound: An algorithm for solving integer programming problems by systematically exploring and pruning the solution space.

Gradient-Based Methods:

  1. Gradient Descent: Iteratively moves towards the minimum of a function by following the negative gradient.
  2. Newton’s Method: Uses second-order derivative information (Hessian matrix) to find the roots of a function or the minima of a function more efficiently.
  3. Conjugate Gradient Method: An iterative method for solving large systems of linear equations and optimization problems, especially useful for quadratic functions.

Heuristic and Metaheuristic Methods:

  1. Genetic Algorithms (GA): Uses natural selection principles to evolve solutions over generations.
  2. Simulated Annealing (SA): Uses a probabilistic approach to avoid local optima by mimicking the annealing process in metallurgy.
  3. Particle Swarm Optimization (PSO): Models the social behavior of swarms to find optimal solutions.
  4. Ant Colony Optimization (ACO): Simulates the foraging behavior of ants to find optimal paths or solutions.
  5. Tabu Search: Uses memory structures to guide the search and avoid local optima.
  6. Differential Evolution (DE): Uses mutation and recombination to evolve solutions towards the optimal.

Approximation Algorithms:

  1. Greedy Algorithms: Make a series of locally optimal choices to find a solution that is hopefully globally optimal.
  2. Local Search: Iteratively explores the neighborhood of a solution to find better solutions, such as in the case of Hill Climbing.

Stochastic and Hybrid Methods:

  1. Bayesian Optimization: Uses probabilistic models to guide the search for optimal solutions, often used for expensive-to-evaluate functions.
  2. Memetic Algorithms: Combines genetic algorithms with local search methods to refine solutions.

Swarm Optimization Algorithms:

1. Particle Swarm Optimization (PSO)
2. Ant Colony Optimization (ACO)
3. Artificial Bee Colony (ABC)
4. Grey Wolf Optimizer (GWO)
5. Firefly Algorithm (FA)

Algorithm-Agnostic Model Building with Mlflow

Agnostic = irrelevant

Creating generic ML pipelines using mlflow.pyfunc

https://towardsdatascience.com/algorithm-agnostic-model-building-with-mlflow-b106a5a29535

วันพฤหัสบดีที่ 8 สิงหาคม พ.ศ. 2567

Coercive citation

 During the peer review process, or when authors have their work provisionally accepted for publication, they may encounter instances where Handling Editors or peer reviewers ask them to consider citing additional sources to ensure a more comprehensive discussion. These references may include papers published in the same journal. World Scientific strongly opposes the practice of demanding authors to include references solely to boost citation numbers without any scientific justification, commonly known as "coercive citation".

Mutated researchers นักวิจัยกลายพันธุ์

Researcher with a changing focus or area of expertise.

Otherwise:

Cross-disciplinary researcher

Interdisciplinary researcher

Researcher with multiple expertise


Mentor

 <> Mentee 

วันพุธที่ 7 สิงหาคม พ.ศ. 2567

NLP research method

 https://kinoshita.eti.br/2017/06/03/natural-language-processing-and-natural-language-understanding.html

https://en.m.wikipedia.org/wiki/Natural_language_processing













Research methods in Natural Language Processing (NLP) have evolved from rule-based linguistics to data-driven statistical models and, most recently, to deep learning architectures. Because NLP sits at the intersection of linguistics, computer science, and statistics, its research methodologies are highly structured and iterative.

Here is an overview of the core research pipeline in modern NLP:


1. Problem Formulation and Data Collection

Research typically begins by identifying a specific task (e.g., Sentiment Analysis, Machine Translation, or Question Answering).

  • Corpus Acquisition: Gathering a large body of text. This can be from web scraping (Common Crawl), specialized datasets (Wikipedia, news archives), or proprietary domain-specific data.

  • Data Annotation: If the research involves supervised learning, human experts must label the data (e.g., tagging parts of speech or identifying "ground truth" answers).

2. Data Preprocessing (The Cleaning Phase)

Raw text is messy and must be standardized before it can be processed by a model.

  • Tokenization: Breaking sentences into individual words or sub-words.

  • Normalization: Lowercasing, removing punctuation, or "Stemming/Lemmatization" (reducing words like "running" to "run").

  • Stop-word Removal: Filtering out common words like "the" or "is" that may not carry significant semantic weight for certain tasks.

3. Feature Engineering 

  • Feature Extraction: Creating new features from raw data (e.g., pulling "Day of the Week" from a raw "Timestamp" or using PCA to condense 100 variables into 5). Feature extaction aka word embedding includes vectorization using Bag of word, word2vec, TF-IDF. In this stage, the researcher decides which characteristics of the text are most informative for the target task.
    • Lexical Features: Extracting specific keywords, N-grams (word sequences), or morphological roots
    • Structural Features: Using Part-of-Speech (PoS) tagging or Dependency Parsing to extract the grammatical relationship between words.
    • Dimensionality Reduction: Techniques like LDA (Latent Dirichlet Allocation) might be used to extract "topics" from a large document set, reducing thousands of words to a few key themes.
  • Feature Representation: Choosing the mathematical format for those features (e.g., turning categories into binary numbers or text into dense vectors). Once features are identified, they must be represented in a format a computer can optimize.
    • Sparse Representation: Using Bag of Words (BoW), One-Hot Encoding or TF-IDF vectors. These are high-dimensional and "sparse" because most values are zero.
    • Dense Representation (Embeddings): Mapping extracted features into a continuous vector space (e.g., Word2Vec, GloVe). This is crucial for capturing semantic similarity—ensuring that terms like "optimization" and "efficiency" are mathematically adjacent.
    • Contextual Representation: Modern research often uses Transformers to create dynamic representations where a word's vector changes based on the surrounding text (e.g., "bank" of a river vs. investment "bank").
    • Feature Selection: Picking the most important features and discarding the noise to prevent "overfitting."
    • Feature Transformation: Scaling or normalizing data (e.g., making sure a "Price" feature and an "Age" feature are on the same scale, like 0 to 1).

4.Feature Modeling

This is where the distinction between "Traditional" and "Modern" NLP is most visible.

  • Statistical/Traditional Methods: Researchers manually define features using techniques like TF-IDF (Term Frequency-Inverse Document Frequency) or N-grams. Models like Hidden Markov Models (HMM) or Support Vector Machines (SVM) are often used here.

  • Deep Learning Methods: Researchers design neural architectures (e.g., Transformers, LSTMs) that automatically learn features through layers. Current research focuses heavily on Large Language Models (LLMs) and Self-Supervised Learning, where the model learns from unlabeled text by predicting missing words.

4. Training and Optimization

  • Objective Functions: Defining a loss function (like Cross-Entropy Loss) that measures how far the model's prediction is from the truth.

  • Hyperparameter Tuning: Adjusting learning rates, batch sizes, and model depth to optimize performance.

  • Transfer Learning: Taking a pre-trained model (like BERT or GPT) and "fine-tuning" it on a specific, smaller dataset for a specialized task.

5. Evaluation Metrics

To determine if a research method is successful, researchers use standardized quantitative metrics:

  • Accuracy/F1-Score: Common for classification tasks.

  • BLEU/ROUGE: Used for translation and summarization to compare machine output against human references.

  • Perplexity: Measures how well a probability model predicts a sample.

  • Human Evaluation: For tasks like creative writing or reasoning, researchers often employ human "judges" to score the output on fluency and factual correctness.

NLP model developing steps by ChatGPT:

To develop an NLP model using the terms provided, the process generally follows these steps:

1. **Data Collection**: Gather and prepare a dataset of text that will be used for training and testing the NLP model.

2. **Tokenization**:

   - **Explanation**: Split the text into smaller units called tokens, which can be words, subwords, or characters. 

   - **Purpose**: Tokenization allows the model to process and analyze text at a granular level.

   - **Example**: "The cat sat on the mat." becomes ["The", "cat", "sat", "on", "the", "mat"].

3. **Stop Words Removal**:

   - **Explanation**: Remove common words that have little meaning on their own, such as "the," "is," and "and."

   - **Purpose**: Reduce noise in the data, focusing the model on more meaningful words.

   - **Example**: After removal, ["The", "cat", "sat", "on", "the", "mat"] might become ["cat", "sat", "mat"].

4. **Stemming**:

   - **Explanation**: Reduce words to their root form by removing suffixes (e.g., "running" → "run").

   - **Purpose**: Simplify words to a common base form, reducing vocabulary size.

   - **Example**: "running", "runner", "ran" all stem to "run".

5. **Lemmatization**:

   - **Explanation**: Similar to stemming, but it reduces words to their base or dictionary form, known as the lemma, considering the context.

   - **Purpose**: Ensure words are reduced to their meaningful base form, which may differ based on context.

   - **Example**: "better" → "good", "running" → "run".

6. **Feature Extraction**:

- **Explanation**: Convert tokens into numerical features that the model can understand.

- **Methods**:

  - **Bag of Words (BoW)**: Represents text by the frequency of words in the document. It ignores the context.

  - **TF-IDF (Term Frequency-Inverse Document Frequency)**: Adjusts the frequency of words by their importance across documents. (https://bdi.or.th/big-data-101/tf-idf-1/)

  - **Word Embedding**: An advanced method that transforms words into dense vectors capturing semantic relationships between words surrounding (i.e., context). Common methods include Word2Vec, GloVe, and FastText.

- **Purpose**: Transform text data into a format suitable for modeling, whether through simple frequency counts or more complex vector representations.

7. **Modeling with Deep Learning Algorithms**:

   - **Explanation**: Use deep learning techniques to build the NLP model.

   - **Purpose**: Leverage complex neural networks to capture patterns and relationships in text data.

   - **Common Models**:

     - **RNN (Recurrent Neural Network)**: Suitable for sequence-based tasks like text generation.

     - **LSTM (Long Short-Term Memory)**: An advanced form of RNN that handles long-term dependencies.

     - **Transformer**: State-of-the-art model architecture for NLP tasks (e.g., BERT, GPT).

8. **Model Training**:

   - **Explanation**: Train the deep learning model using the processed text data.

   - **Purpose**: Optimize model parameters to minimize error and improve accuracy.

9. **Evaluation**:

    - **Explanation**: Assess the model's performance on a validation set.

    - **Purpose**: Ensure the model generalizes well to unseen data.

10. **Deployment**:

    - **Explanation**: Integrate the trained model into a production environment.

    - **Purpose**: Make the model available for practical use.

11. **Monitoring and Maintenance**:

    - **Explanation**: Continuously monitor the model's performance and update it as needed.

    - **Purpose**: Ensure the model remains accurate and relevant over time

Example implementation of ChatBot using LSTM

https://medium.com/@newnoi/%E0%B8%A1%E0%B8%B2%E0%B8%AA%E0%B8%A3%E0%B9%89%E0%B8%B2%E0%B8%87-chatbot-%E0%B9%81%E0%B8%9A%E0%B8%9A%E0%B9%84%E0%B8%97%E0%B8%A2%E0%B9%86-%E0%B8%94%E0%B9%89%E0%B8%A7%E0%B8%A2-machine-learning-lstm-model-%E0%B8%81%E0%B8%B1%E0%B8%99%E0%B8%94%E0%B8%B5%E0%B8%81%E0%B8%A7%E0%B9%88%E0%B8%B2-part1-6230eac8d1f8

https://medium.com/@newnoi/%E0%B8%AA%E0%B8%AD%E0%B8%99%E0%B8%84%E0%B8%AD%E0%B8%A1%E0%B8%9E%E0%B8%B9%E0%B8%94%E0%B9%81%E0%B8%9A%E0%B8%9A%E0%B9%84%E0%B8%97%E0%B8%A2%E0%B9%86-%E0%B8%94%E0%B9%89%E0%B8%A7%E0%B8%A2-machine-learning-model-part2-2a1609af1bd7