Chat with us, powered by LiveChat Project 3 – Ensemble Methods and Unsupervised Learning - Writeden

In this project you will explore some techniques in unsupervised learning as well as ensemble methods. It is important to realize that understanding an algorithm or technique requires understanding how it behaves under a variety of circumstances. You will go through the process of choosing and exploring two classification datasets, tuning the algorithms you have learned about, writing a thorough analysis of your findings, and presenting your findings. The most crucial part of this assignment is the analysis and your ability to explain and justify your results.

 

I. Choosing Datasets

 

The first task in this assignment is choosing two interesting classification datasets, these can be binary or multiclass. The features can be of any type, and it is recommended that you choose datasets with diverse feature sets. I don’t care where you get the data from. You can download some, take some from your own research, or make some up on your own. What I do care about is that the datasets must be interesting. They should contain a decent amount of features and a sufficiently large amount of examples. Do not choose an “easy” dataset, however don’t go crazy either trying to find the perfect one. Your two datasets should also differ in some way such that you can compare and contrast your results between the two. You should also be following standard machine learning practice by splitting your dataset into training and testing, and only touching the testing dataset at the very end when you are ready to report results. (Cross validation is highly recommended).

 

II. Coding (10%)

 

After choosing your datasets you will now be tasked with writing code to apply the machine learning algorithms you have learned about. Your code must be written in python, but you may use any libraries that have already implemented the machine learning algorithms (e.g scikit-learn). You are not expected to code the algorithms from scratch, and in fact I would highly discourage it. What you may not do is copy code from the internet. Below are the analyses you are required to run.

 

1) Run K-means and Hierarchical Clustering on your datasets and analyze what you observe.

 

2) Run two dimensionality reduction algorithms (PCA and t-SNE) on your datasets. Observe and analyze the results.

 

3) Re-run the K-means and Hierarchical Clustering on your dimensionality reduced datasets and compare the results to part (1).

 

4) Tune and train two ensemble models (AdaBoost and Random Forests) on both your original and dimensionality reduced datasets. Compare and analyze the results.

 

Your code does not have to be pretty or well written. However, it must be written in python and I must be able to run one script (main.py) that will produce all the results and figures in your report.

 

III. Report (80%)

 

You will then produce a report describing and analyzing your methods and results. Here you will describe the datasets you have chosen and why they are interesting. You will then provide an analysis on how the different machine learning algorithms performed on each dataset. The report must be limited to 10 pages maximum. Plots and figures are highly recommended. It is up to you how you wish to demonstrate your understanding of the machine learning algorithms you have explored, but below I have listed some potential ideas for analysis and items you may wish to include in the report.

 

• A description of your two datasets and why you feel that they are interesting. • Hypotheses on how you believe the learning algorithms will perform on each

 

dataset and why. • How you dealt with different features in your datasets? missing data? different

 

scalings? • Training and testing error rates you obtained for your various learning

 

algorithms (some sort of cross validation is highly recommended) • The effect of hyperparameters on performance • Comparing and contrasting results between datasets • Comparing and contrasting results between learning algorithms • Training and testing error rates as a function of training dataset size • Timing analysis of how long it takes to train/test each algorithm • Conclusions • Ideas for future analyses • What you may have done differently • References

 

You are NOT being graded on how well the algorithms perform on your datasets. What is most important is WHY? You should be explaining and justifying all of your figures and results, and demonstrating that you understand the intricate details of the machine learning process, and the machine leaning algorithms you are using.

 

IV. Presentation (10%)

 

Finally you will give a maximum 7 minute presentation of your results (You will be cut off exactly at the 7 minute mark). In this presentation you will describe your datasets, your methods, and any interesting results you found!

 

What to turn in?

 

Below is a list of items you will be required to turn in via canvas. Please make sure all documents are named as described bellow.

 

• report.pdf – Your maximum 10 page report in pdf format. Do not use super tiny or large font. No specific formatting is required but use common sense.

 

• presentation.pptx or presentation.key – Your presentation slides either in a powerpoint or keynote document.

 

• code.zip – A zip file with all of the code you have written. Within the folder there should a file called README.txt that contains instructions on how to run your code, and a python file called main.py that will produce all figures and plots in your report/presentation. I should be able to reproduce your results easily.

 

• data.zip – A zip file that contains the two datasets you have chosen.

 

Grading

 

You are being scored on your analysis more than anything else. Roughly speaking, implementing everything and getting it to run is worth very little for this assignment. Of course though, analysis without proof of working code makes the analysis suspect. The key thing is that your explanations should be

 

both thorough and concise, and your analysis should prove to me you have a deep understanding of the machine learning process and the machine learning algorithms you are using.