Principle Component Analysis
- Dimension reductio: JL lemma d = Ω ( log  n ϵ 2 ) d=\Omega\left(\frac{\log n}{\epsilon^2}\right) d=Ω(ϵ2logn) to remain the distance of n n n data points.
- Goal of PCA 
  - maximize variance: E [ ( v ⊤ x ) 2 ] = v ⊤ X X ⊤ v \mathbb{E}[(v^\top x)^2]=v^\top XX^\top v E[(v⊤x)2]=v⊤XX⊤v for ∣ v ∣ = 1 |v|=1 ∣v∣=1
- minimize reconstruction error: E [ ∣ x − ( v ⊤ x ) v ∣ 2 ] \mathbb{E}[|x-(v^\top x)v|^2] E[∣x−(v⊤x)v∣2]
 
- Find v i v_i vi iteratively, project data points onto subspace expanded by v 1 , v 2 , . . , v d v_1,v_2,..,v_d v1,v2,..,vd
- How to find  
      
       
        
        
          v 
         
        
       
         v 
        
       
     v ? 
  - Eigen decomposition: X X ⊤ = U Σ U ⊤ XX^\top =U\Sigma U^\top XX⊤=UΣU⊤
- v 1 v_1 v1 is the eigenvector of maximum eigenvalue.
- Power method
 
Nearest Neighbor Classification
- KNN: K-nearest neighbor
- nearest neighbor search: Locality sensitive hashing algorithm(LSH)* 
  - Randomized c c c-approximate R R R-near neighbor( ( c , R ) (c,R) (c,R)-NN): A data structure that at least gives a c R cR cR neighbor in some probability if R R R neighbor exists.
- A family  
        
         
          
          
            H 
           
          
         
           H 
          
         
       H is called  
        
         
          
          
            ( 
           
          
            R 
           
          
            , 
           
          
            c 
           
          
            R 
           
          
            , 
           
           
           
             P 
            
           
             1 
            
           
          
            , 
           
           
           
             P 
            
           
             2 
            
           
          
            ) 
           
          
         
           (R,cR,P_1,P_2) 
          
         
       (R,cR,P1,P2)-sensitive if for any  
        
         
          
          
            p 
           
          
            , 
           
          
            q 
           
          
            ∈ 
           
           
           
             R 
            
           
             d 
            
           
          
         
           p,q\in \mathbb{R}^d 
          
         
       p,q∈Rd 
    - if ∣ p − q ∣ ≤ R |p-q|\le R ∣p−q∣≤R, then Pr  H [ h ( q ) = h ( p ) ] ≥ P 1 \Pr_H[h(q)=h(p)]\ge P_1 PrH[h(q)=h(p)]≥P1
- if ∣ p − q ∣ ≥ c R |p-q|\ge cR ∣p−q∣≥cR, then Pr  H [ h ( q ) = h ( p ) ] ≤ P 1 \Pr_H[h(q)=h(p)]\le P_1 PrH[h(q)=h(p)]≤P1
- P 1 > P 2 P_1>P_2 P1>P2
 
- Algroithm based on LSH family: 
    - Construct g i ( x ) = ( h i , 1 ( x ) , h i , 2 ( x ) , . . . , h i , k ( x ) ) , 1 ≤ i ≤ L g_i(x)=(h_{i,1}(x),h_{i,2}(x),...,h_{i,k}(x)),1\le i\le L gi(x)=(hi,1(x),hi,2(x),...,hi,k(x)),1≤i≤L. All h i , j h_{i,j} hi,j are iid from H H H.
- Check the element in the bucket of g i ( q ) g_i(q) gi(q), whether it’s c R cR cR-near neighbor of q q q. Until we check 2 L + 1 2L+1 2L+1 times.
- if R R R neighbor exists, w.p. at least 1 2 − 1 e \frac{1}{2}-\frac{1}{e} 21−e1 find c R cR cR-neighbor
- ρ = log  1 / P 1 log  1 / P 2 , k = log  1 / P 2 ( n ) , L = n ρ \rho=\frac{\log 1/P_1}{\log 1/P_2},k=\log_{1/P_2}(n),L=n^\rho ρ=log1/P2log1/P1,k=log1/P2(n),L=nρ
- Proof
 
 
Metric Learning
- project x i x_i xi into f ( x i ) f(x_i) f(xi)
- Hard version(compare label of its neighbor)- soft version
- Neighborhood Component Analysis(NCA) 
  - p i , j ∼ exp  ( − ∥ f ( x i ) − f ( x j ) ∥ 2 ) p_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2) pi,j∼exp(−∥f(xi)−f(xj)∥2)
- maximize ∑ i ∑ j ∈ C i p i , j \sum_{i}\sum_{j\in C_i}p_{i,j} ∑i∑j∈Cipi,j
 
- LMNN:  
      
       
        
        
          L 
         
        
          = 
         
        
          max 
         
        
           
         
        
          ( 
         
        
          0 
         
        
          , 
         
        
          ∥ 
         
        
          f 
         
        
          ( 
         
        
          x 
         
        
          ) 
         
        
          − 
         
        
          f 
         
        
          ( 
         
         
         
           x 
          
         
           + 
          
         
        
          ) 
         
         
         
           ∥ 
          
         
           2 
          
         
        
          − 
         
        
          ∥ 
         
        
          f 
         
        
          ( 
         
        
          x 
         
        
          ) 
         
        
          − 
         
        
          f 
         
        
          ( 
         
         
         
           x 
          
         
           − 
          
         
        
          ) 
         
         
         
           ∥ 
          
         
           2 
          
         
        
          + 
         
        
          r 
         
        
          ) 
         
        
       
         L=\max(0,\|f(x)-f(x^+)\|_2-\|f(x)-f(x^-)\|_2+r) 
        
       
     L=max(0,∥f(x)−f(x+)∥2−∥f(x)−f(x−)∥2+r) 
  - x + , x − x^+,x^- x+,x− are worst cases.
- r is margin
 
Spectral Cluster
- K-means
- Spectral graph clustring 
  - Graph laplacian: L = D − A L=D-A L=D−A, A A A represents the similarity.
- #zero eigenvalue = # connected component
- Smallest k k k eigenvectors gives a partition of k k k clusters, do k k k-means on the row
- Ratio cut can be transfered into finding the k k k smallest eigenvectors, which is the same as graph laplacian.
 
SimCLR*
-  Intelligence is positioning 
-  InfoNCE loss 
 L ( q , p 1 , { p i } i = 2 N ) = − log  exp  ( − ∥ f ( q ) − f ( p 1 ) ∣ 2 / ( 2 τ ) ∑ i = 1 N exp  ( − ∥ f ( q ) − f ( p i ) ∣ 2 / ( 2 τ ) L(q,p_1,\{p_i\}_{i=2}^N)=-\log \frac{\exp(-\|f(q)-f(p_1)|^2/(2\tau)}{\sum_{i=1}^{N}\exp(-\|f(q)-f(p_{i})|^2/(2\tau)} L(q,p1,{pi}i=2N)=−log∑i=1Nexp(−∥f(q)−f(pi)∣2/(2τ)exp(−∥f(q)−f(p1)∣2/(2τ)
-  Learn Z = f ( x ) Z=f(x) Z=f(x): map original data points into a space that semantic similarity is captured naturally. - Reproducing kernel Hilbert space: k ( f ( x 1 ) , f ( x 2 ) ) = ⟨ ϕ ( f ( x 1 ) ) , ϕ ( f ( x 2 ) ) ⟩ H k(f(x_1),f(x_2))=\langle\phi(f(x_1)),\phi(f(x_2))\rangle_H k(f(x1),f(x2))=⟨ϕ(f(x1)),ϕ(f(x2))⟩H. Inner product is a kernel function.
- Usually, K Z , i , j = k ( Z i − Z j ) K_{Z,i,j}=k(Z_i-Z_j) KZ,i,j=k(Zi−Zj), k k k is gaussian.
 
-  We have a similarity matrix π \pi π about the dataset previously. π i , j \pi_{i,j} πi,j is the similarity of data i i i and j j j. We want the similarity matrix K Z K_Z KZ of f ( x ) f(x) f(x) is the same as that of x x x which is given manually. Let W X ∼ π , W Z ∼ K Z W_X\sim \pi,W_Z\sim K_Z WX∼π,WZ∼KZ, we want these two samples are the same. 
-  Minimize crossentropy loss: H π k ( Z ) = − E W X ∼ P ( ⋅ ; π ) [ log  P ( W Z = W X ; K Z ) ] H_{\pi}^{k}(Z)=-\mathbb{E}_{W_X\sim P(\cdot ;\pi)}[\log P(W_Z=W_X;K_Z)] Hπk(Z)=−EWX∼P(⋅;π)[logP(WZ=WX;KZ)] - Equivalent to InfoNCE loss: Only care about row i i i, infoNCE loss is log  ( W Z , i = W X , i ) \log(W_{Z,i}=W_{X,i}) log(WZ,i=WX,i). The given pair q , p 1 q,p_1 q,p1 are sampled from similarity matrix π \pi π, which corresponds to W X ∼ P ( ⋅ ; π ) W_X\sim P(\cdot;\pi) WX∼P(⋅;π).
- Equivalent to spectral clustering: equaivalent to arg  min  Z t r ( Z ⊤ L ∗ Z ) \arg \min_Ztr(Z^\top L^*Z) argminZtr(Z⊤L∗Z)
 
t-SNE
-  data visualization: map data into low dimension space(2D) 
-  SNE: Same as NCA, want q i , j ∼ exp  ( − ∥ f ( x i ) − f ( x j ) ∥ 2 / ( 2 σ 2 ) ) q_{i,j}\sim \exp(-\|f(x_i)-f(x_j)\|^2/(2\sigma^2)) qi,j∼exp(−∥f(xi)−f(xj)∥2/(2σ2)) to be similar to p i , j ∼ exp  ( − ∥ x i − x j ∥ 2 / ( 2 σ i 2 ) ) p_{i,j}\sim \exp (-\|x_i-x_j\|^2/(2\sigma_i^2)) pi,j∼exp(−∥xi−xj∥2/(2σi2)) - CrossEntropy loss − p i , j ⋅ log  q i , j p i , j -p_{i,j}\cdot \log \frac{q_{i,j}}{p_{i,j}} −pi,j⋅logpi,jqi,j
 
-  Crowding problem 
-  Solved by t-SNE: let q i , j ∼ ( 1 + ∥ y j − y i ∥ 2 ) − 1 q_{i,j}\sim (1+\|y_j-y_i\|^2)^{-1} qi,j∼(1+∥yj−yi∥2)−1(student t-distribution) - The power − 1 -1 −1 is more heavy tail than Gaussian, then we can solve the crowding problem by shifting the distance.
 









