I want to prove that my proposed machine learning algorithm (prop_ml) is better than other baseline algorithms (ml_1, ml_2, ml_3) when given a **small number of data for training**. What I’ve done is to split a dataset into train and test sets. Then, I’ve randomly selected small **k samples** (10, 20, 30, … 100) from the train set and used them to train the classifiers and used the test set for testing. I’ve replicated this 5 times to make sure I got some reliable results.

Now, I want to evaluate the results. Any suggestions on a **statistical test** that I can use to prove that the proposed ml is better or not? Thanks.