Dr.Jiw: พฤษภาคม 2025

วันศุกร์ที่ 23 พฤษภาคม พ.ศ. 2568

Data visualization tools

PowerBI

Tableau

Looker studio (Google): https://lookerstudio.google.com/u/0/navigation/reporting

More at https://dataforest.ai/blog/best-business-intelligence-tools

K parametric clustering algorithms vs Nonparametric clustering algorithms

Examples of K-parametric clustering algorithms:

K-Means: sum of squared Euclidean distance is the objective function ; K-Means performs a heuristic optimization to minimize an objective function: the within-cluster sum of squares (WCSS), also known as inertia. Clusters created by k-means are spherical but decision boundaries to separate data points are linear.
K-Medoids (PAM)
Gaussian Mixture Models (GMMs) – though they model distributions, you still need to define k
Spectral Clustering – often requires k for the number of eigenvectors/clusters
1D K-means with dynamic programming : within cluster sum of square is the objective function just as k means (https://medium.com/@andreys95/optimal-1d-k-means-with-dynamic-programming-4d6ff57b6244) Its applications are

1. Image and Signal Processing
- Edge detection in 1D signals: Identifying abrupt changes in intensity, such as in electrocardiogram (ECG) signals or sound waveforms.
- Grayscale image analysis: Clustering pixel intensities for thresholding and segmentation (e.g., Otsu’s method).
2. Anomaly Detection
- Outlier detection: Identifying unusual data points in a sequence such as in temperature logs, stock prices, or sensor readings.
- Network intrusion detection: Anomalous traffic volumes or latencies can be flagged using 1D clustering.
3. Finance and Economics
- Price segmentation: Grouping stock prices, customer expenditures, or transaction amounts into clusters for analysis or marketing.
- Economic indicator binning: Simplifying complex metrics like inflation rates or GDP growth into meaningful ranges.
4. Healthcare and Medicine
- Vital sign monitoring: Clustering heartbeat intervals, glucose levels, or other biometric time series to identify normal vs. abnormal ranges.
- Dosage grouping: Categorizing drug dosages for different patient groups or treatment levels.
5. Industrial and IoT Applications
- Sensor data clustering: Classifying temperature, vibration, or pressure readings for predictive maintenance.
- Energy usage analysis: Segmenting power consumption values to optimize resource distribution.
6. Education and Testing
- Score grading: Clustering test scores to assign grades or identify performance bands.
- Learning analytics: Grouping students by time spent or attempts on a quiz for intervention strategies.
7. Natural Language Processing (NLP)
- Word length or frequency clustering: Used in stylometric analysis or feature engineering in text mining.
8. Retail and Marketing
- Customer segmentation: Based on a single metric like frequency of purchase or average order value.
- Pricing strategy: Grouping products by their price points for tiered marketing approaches.

Examples of Nonparametric clustering algorithms:

DBSCAN – defines clusters based on density, not a fixed k
OPTICS – an extension of DBSCAN, good for varying densities
Mean Shift – mode-seeking algorithm that finds clusters around data density peaks
Hierarchical Clustering – builds a dendrogram that can be cut at any level to form clusters

=========

K-Means++ is a smarter way to initialize centroids for the K-Means algorithm. It improves both the accuracy and stability of clustering by reducing the chance of poor local minima. Standard K-Means randomly picks initial centroids, which can Lead to bad clusterings (poor local optima) & Require multiple restarts to get good results. K-Means++ Initialization Steps:

Randomly select the first centroid \mu_1 from the dataset.
For each data point x, compute the squared distance D(x)^2 to the nearest already chosen centroid.
Select the next centroid with probability:
P(x) = \frac{D(x)^2}{\sum_{x’ \in X} D(x’)^2}
→ This favors points far from existing centroids.
Repeat steps 2–3 until k centroids are chosen.
Run standard K-Means using these initialized centroids.

Example:

You’ve selected 1 centroid: \mu_1
You have 5 data points with distances to \mu_1:
D(x_1)^2 = 1,\quad D(x_2)^2 = 4,\quad D(x_3)^2 = 9,\quad D(x_4)^2 = 16,\quad D(x_5)^2 = 0.25
Total = 1 + 4 + 9 + 16 + 0.25 = 30.25

Then the probability of picking x_4 as the next centroid is: P(x_4) = \frac{16}{30.25} \approx 0.529

Ph.D. vs D.Eng.

The Ph.D. (Doctor of Philosophy) and the D.Eng. (Doctor of Engineering) are both doctoral-level degrees, but they differ primarily in focus, purpose, and career trajectory:

1. Focus and Purpose

Ph.D.:

Focuses on original theoretical research.

Aims to contribute new knowledge to a field.

Often prepares candidates for academic careers (professorships, research institutes).

D.Eng. (or Eng.D.):

Emphasizes applied research and practical problem-solving in engineering contexts.

Designed to be more relevant to industry than academia.

Often involves collaboration with companies or government agencies.

2. Dissertation Style

Ph.D.:

Usually results in a highly theoretical dissertation.

Often includes formal models, proofs, or simulations with theoretical insights.

D.Eng.:

Typically results in a practical engineering project or case study, with real-world implementation.

May include prototype development, system design, or applied innovations.

AlphaEvolve

Generative AI vs Analytic AI

AI winters

Turing test

การทดสอบจะให้คนกับคอมฯ ทางซ้ายในรูป มาคุยกับผู้เข้าทดสอบทางขวาในรูป

การทดสอบจะผ่านก็ต่อเมื่อผู้เข้าทดสอบแยกไม่ออกว่ากำลังคุยกับคนหรือคอมพิวเตอร์

วันพฤหัสบดีที่ 22 พฤษภาคม พ.ศ. 2568

Agentic coding

Refers to coding with AI assistance. Tools include

ChatGPT
GitHub copilot https://github.com/features/copilot
Amazon CodeWhisperer https://docs.aws.amazon.com/codewhisperer/latest/userguide/what-is-cwspr.html

Preprocessing VS Remove noisy feature

When training an ml model, we can do data cleansing. But during real deployment, how to cope with the noise data? Key strategies to cope with noisy data during deployment:

1. Preprocessing Pipeline in Production

Build a real-time data preprocessing pipeline similar to the one used during training. This may include:

Normalization/standardization
Missing value imputation
Outlier filtering
Text/token cleanup (e.g., lowercasing, removing symbols)
Use the same logic and codebase (or serialize transformers like sklearn's scalers, spaCy pipelines, or TensorFlow preprocessing layers).

2. Robust Model Training

Train your model to be robust against noise:

Add noise (augmentation) during training to simulate real-world scenarios (e.g., drop words, add typos, jitter numeric features).
Use regularization techniques (e.g., dropout, L2) to prevent overfitting on overly clean data.

3. Confidence Thresholds

Use prediction confidence scores (from softmax or probability outputs) to:

Reject uncertain predictions
Flag them for human review or fallback systems

4. Input Validation & Sanity Checks

Before feeding data to the model, validate input:

Reject ill-formed entries (e.g., empty strings, NaNs)
Ensure values fall within expected ranges or categories
Log or alert on anomalies

5. Ensemble Models or Rule-based Fallback

Use ensemble models (e.g., majority vote, stacking) that tend to be more stable with noisy input.
Add rule-based systems as a fallback for edge cases (e.g., if input is incomplete or invalid).

6. Online Monitoring & Feedback Loop

Monitor input quality and model performance in production
Detect concept drift or change in noise patterns
Use logs to retrain/update the model periodically

7. Data Denoising Models

In high-noise environments, deploy a denoising model or filter before the main model:

For images: autoencoders, image filters
For text: typo correction, spell checking
For time series: smoothing, Kalman filters

So, if a feature is noisy and doesn't contribute useful signal, then it's often better to remove it.

--ChatGPT

วันพุธที่ 21 พฤษภาคม พ.ศ. 2568

Boom timeline of AI services

Deep learning has brought AI back into the spotlight.

Face-recognition time clock
Siri & Alexa (Voice-based AI)
AlphaGo
ChatGPT (Text-based AI) (Big 5: OpenAI, Google, Meta, MS, Amazon)

การ mentor งานวิจัยที่ไม่เชี่ยวชาญโดยตรง

ใช้การให้คำแนะนำตามระเบียบวิธีวิจัยแบบ metaheuristics (แทนแบบ heuristic ที่เป็น field specific)

วันอังคารที่ 20 พฤษภาคม พ.ศ. 2568

How to compare algorithms?

In case there are too many prior algorithms, what should you do?

Option 1: If you are lucky in that all of them use the same benchmark dataset then you can compare your proporsed algorithm by using such a benchmark.

Option 2: Select a representative algorithms as baselines. The baselines should include well-known, state-of-the-art, and top-performer. Importantly, the selected baselines should cover methodologically different styles or strategies. And conduct extensive experiment with various evaluation metrics.

Option 3: Determine the formally global optimum and compare your approach against it. If your approach reach the optimum, no need to compare with local optimum algorithms at all. This is a very strong and elegant strategy when applicable. Use the following metrics:

1.Gap to optimum

2.Time to reach optimum

3.Stability over multiple runs (for stochastic algorithms)

If your algorithm consistently reaches or nearly reaches the global optimum:

That’s clear evidence that local-optimal algorithms (like greedy, GA, PSO, etc.) are unnecessary for comparison. You can claim your algorithm is globally optimal or near-optimal in practice. You can skip comparing with heuristic/metaheuristic baselines if:

Your algorithm reaches the global optimum in all test cases, or

It comes within a very tight tolerance (say, ≤1%) and is significantly faster.

This not only saves space and time in your paper, but also strengthens your scientific rigor, since you base your results on a provable benchmark.

วันจันทร์ที่ 19 พฤษภาคม พ.ศ. 2568

Scikit-learn Machine Learning in Python

https://scikit-learn.org/stable/

วันพฤหัสบดีที่ 15 พฤษภาคม พ.ศ. 2568

Norm-referenced VS Criterion-based Achievement Grading

Norm-referenced method: When the exam only samples part of the content (due to time constraints, for instance), it may not fully reflect all students' knowledge or abilities. In such cases, norm-referencing helps distinguish performance levels relative to peers, especially if the exam is designed to be difficult or selective.

Criterion-based method: If the exam fully aligns with the course objectives and content, it's appropriate to assess students based on predetermined criteria. In this case, every student theoretically has an equal opportunity to succeed by demonstrating mastery of the material.

===

📌 Why Use Norm-Referenced Grading When an Exam Has Incomplete Content Coverage?

It's not that incomplete content coverage requires norm-referenced grading, but rather that norm-referenced grading can be more practical or justifiable in that situation. Here’s the reasoning:

🔍 1. Incomplete Coverage = Limited Validity for Mastery Judgments

In criterion-based grading, you’re judging whether students meet specific learning outcomes.
But if the test only covers part of what was taught, you can’t be sure a student has mastered all intended outcomes — the exam doesn’t measure them all.
This makes it difficult to fairly say “Student A met the standard” if the test didn’t assess the full standard.

✅ So: If content coverage is partial, a claim like “meets expectations” (criterion-based) is less valid.

🔍 2. Norm-Referencing Focuses on Ranking, Not Mastery

Norm-referenced grading doesn't claim to assess full mastery — it just compares students to each other.
If everyone is tested on the same (even partial) content, you can still rank performance fairly.
This is especially common in competitive settings like entrance exams, where the goal is to identify the top X%.

✅ So: Even if the content is partial, norm-referenced grading can still say, “Student A performed better than 85% of peers.”

🔍 3. Selective Testing is Often Meant to Differentiate, Not to Measure Everything

Some exams (especially in large-scale or competitive environments) are designed to be selective, challenging, and not to reflect all course content.
In these cases, norm-referencing is deliberate — the test's role is to discriminate between levels of performance, not verify full learning.

📘 Example to Illustrate

Imagine a computer science course with 10 learning objectives:

Scenario A (Criterion-Based Fit): The final exam has one question for each objective, with rubrics. You can say “Students mastered 8 out of 10 objectives.”
Scenario B (Norm-Referenced Fit): The exam covers only 4 objectives in depth (due to time constraints), with high difficulty and trick questions. You can’t judge overall mastery, but you can still say “Student A is in the top 10%.”

--ChatGPT

วันพุธที่ 14 พฤษภาคม พ.ศ. 2568

WCSS VS DBI

The Within-Cluster Sum of Squares (WCSS) is a measure used to evaluate the compactness of clusters in a clustering algorithm, such as k-means. It calculates the sum of squared distances between each data point and the centroid (mean) of the cluster it belongs to.

Mathematically, the WCSS for a set of clusters is expressed as:

WCSS (Within-Cluster Sum of Squares) and DBI (Davies–Bouldin Index) are both metrics used to evaluate clustering performance, but they focus on different aspects:

---

Objective:

WCSS Measures compactness of clusters (intra-cluster similarity)

DBI Balances compactness and separation between clusters (inter-cluster dissimilarity)

Key Differences:

WCSS looks only within clusters.

DBI looks both within and between clusters.

Data anonymization

คือการนำข้อมูลส่วนบุคคลไปประมวลผลให้เกิดคุณค่าทางธุรกิจต่อไปได้โดยไม่เปิดเผยการระบุตัวตน
Pseudonymization vs Anonymization

Pseudonymization

- Definition: Personal data is replaced with pseudonyms (e.g., codes, numbers) but can still be re-identified using additional information (e.g., a key).
- Reversibility: Reversible – the original data can be restored if the pseudonym and key are combined.
- Purpose: Reduces risks during data processing, storage, or sharing, while still allowing for re-identification when necessary (e.g., in medical research).
- Example: Replacing patient names with IDs in a health database, while keeping a separate file that links IDs to names.
- GDPR Status: Still considered personal data, but offers some compliance benefits if implemented correctly.
Anonymization
- Definition: Personal data is irreversibly altered so that the individual can no longer be identified, directly or indirectly.
- Reversibility: Irreversible – the data cannot be traced back to a person.
- Purpose: Used when there's no need to identify individuals, such as for open data publication or aggregate analysis.
- Example: Aggregating survey results so that individual responses cannot be linked to specific participants.
- GDPR Status: Not considered personal data – once data is truly anonymized, GDPR no longer applies.

วันอังคารที่ 13 พฤษภาคม พ.ศ. 2568

Thailand's Giga Data Center Initiative

Thailand has announced plans to invest approximately 170 billion baht (around USD 4.7 billion) to establish itself as a Giga Data Hub in the ASEAN region. This initiative is a collaboration between the Thai government, the Charoen Pokphand Group (CP Group), and global investment firm Global Infrastructure Partners (GIP). The goal is to develop advanced data center infrastructure, positioning Thailand as a central hub for data services in Southeast Asia.