diff --git a/tutorials.md b/tutorials.md new file mode 100644 index 0000000..a559f74 --- /dev/null +++ b/tutorials.md @@ -0,0 +1,51 @@ +# Tutorials + +## Week1 tutorial + +- Calculation: (formulas are given from test paper) + - $Accuracy = \frac{Correct Classifications}{Total Classification} = \frac{TP + TN}{TP + TN + FP + FN}$ + - $F1 = \frac{2}{recall^{-1} + precision^{-1}} = \frac{2 \times TP}{2 \times TP + FP + FN}$ +- Accuracy vs. F1: + - Accuracy: TP and TN are more important + - F1: FP and FN are more important, used for imbalanced classes + +## Week2 tutorial + +- IQR: difference between the 25% (Q1) and the 75% (Q3) in a dataset + - The spread of 50% of values + - Popular method of defining observation: + - Finding median, Q1, Q3, Upper bound, Lower bound + - Method: https://www.scribbr.com/statistics/interquartile-range/ + +## Week 3 tutorial + +- K-means clustering: + - Initialize K + - Assign random K points to be centroids + - Assign each data point to closest centroid + - Calculate the mean, and place a new centroid (doesn't have to be on a + point) to each cluster + - Repeat, until centroid doesn't change anymore + +## Week 4 tutorial + +- Euclidean distance +- Cosine similarity + - Useful for applications with sparse data, since even if the objects are + far in euclidean distance, they can still have a small angle between. + - Word documents (NLP) + - Market transaction data + - Recommendation system + - Image on computer + - Because 0, 0 data will be ignored + - Values: + - Cos close to 1: similar + - Cos close to 0: orthogonal, not related + - Cos close to -1: opposite + - Calculation: + $Similarity(A,B) = cos(\theta) = \frac{A \dot B}{||A||\times||b||}$ + - $\theta$ is the angle between vectors + - $A \dot B$ is the dot product, $A_1 B_1 + A_2 B_2 + ... + A_n B_n$ + - $||A||$ is the magnitude of vector, + $\sqrt{A^2_1 + A^2_2 + ... + A^2_n}$ + - Calculate the angle with $arccos(\theta)$