The practical part of this course is grounded in Kaggle competitions organized specifically for the class.
The first competition aims at classifying emotions on faces:
https://inclass.kaggle.com/c/facial-keypoints-detector
The second competition aims at identifying keypoints on the face images:
https://inclass.kaggle.com/c/facial-keypoint-detection
The duty of each student is to create an experiments journal (could be a WordPress blog) documenting their weekly effort towards implementing and experimenting with representation-learning algorithms on the provided datasets. A large part of the grade assigned for the practical part of the course will depend on that journal. There are three criteria by which these journals will be evaluated:
- Effort and completeness: the journal documents your efforts, its completeness and regularity of entries are important.
- Understandability: other graduate students of the same class should be able to understand what you did, the main ideas, their motivations, and the main conclusions of your observations. If you were to write a paper about these experiments, this would be the main criteria of evaluation. There should be executive summaries of the main results as you progress along the way (the stuff that would go in abstracts or conclusions of scientific papers).
- Reproducibility: other graduate students of the same class should be able to reproduce your experiments and get sensibly the same results, without having to do to much work. That means your code and scripts should be publicly available in some web-based repository, and that the description of experiments (e.g., settings of hyper-parameters) are clear enough and clearly linked with the results obtained.
Did anyone successfully train the keypoint data with unsupervised pretraining ?
I tried several structure: RBM-RBM-RBM-Softmax, DAE-DAE-DAE-Softmax,
RBM-DAE-DAE-Softmax.
None of them give the reasonable result.. I dont know its because my configuration is not optimal or some other problems. If anyone successfully train the data with unsupervised pretraining, can you give me the link to your repo, thanks!
Yoshua and I are organizing a workshop at the International Conference on Machine Learning, along with UdeM alumnus Dumitru Erhan. As part of the workshop, we’re hosting three different Kaggle competitions:
http://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/
http://www.kaggle.com/c/challenges-in-representation-learning-multi-modal-learning
http://www.kaggle.com/c/challenges-in-representation-learning-the-black-box-learning-challenge/leaderboard
First prize for each of them is $350 + a speech at the workshop.
This is not part of the class at all, but your experience in this class has prepared you to do well in any of these contests if you are interested. The facial expression recognition challenge in particular is very similar to the challenge you completed for the first half of this course, but using a different dataset.
I found something else worth improving in the dataset script. The data points in the train.csv are organised by source dataset (the train.csv has been assembled from data coming from 4 distinct datasets). This means that the first n1 lines in the csv come from the first source dataset, the next n2 lines come from the second source dataset, etc. On top of that, the dataset script does not shuffle the dataset before splitting it into training and validation.
Together, these 2 facts mean that when you split that data into a training and a validation set, your validation set only contains data from the last source dataset. This makes the validation error a very bad proxy for the public_test error because the public_test dataset actually contains data from all 4 source dataset.
I already mentionned this to Vincent Archambault and he is currently working on a fix for this.
I’ll add a Preprocessor class to shuffle and split any DenseDesignMatrix. That way we don’t need to solve this problem separately every time we make a new Dataset.
Use the ShuffleAndSplit preprocessor that I just added: https://github.com/lisa-lab/pylearn2/blob/master/pylearn2/datasets/preprocessing.py
Sorry, this doesn’t actually work for this class, because the submission script will try to split the test set. If someone wants to come up with a better interface for handling train and test sets I’m open to suggestions.
In my local copy, I commented out the line :
assert stop = 1783 (length of the test set), it all works fine. It is a bit hacky but it works in this case, though I agree a more long-term solution should be considered. I’ll try to think about it.
Damn you, WordPress and your HTML parsing!
The line I commented out was : assert stop <= X.shape[0]
Vincent fixed the bug in the dataset class for the contest. I’ve merged his pull request into the class copy of the dataset, and copied over his numpy files, so that should work fine for everyone now. Let me + Vincent know if you still have trouble.
I modified the script for loading the dataset. Now it uses a .npy file. This should solve the memory problem. It also loads much faster.
If you use the Lisa lab computers you don’t need to change anything. If you work from home, update your ContestDataset repository.
Hey, it seems your script has a bug; when the .yaml file defines a training and a validation set, both taken from which_set=’train’, the dataset script returns the same data for both the training and validation set. When I use this new version of your script with your baseline .yaml, my training and validation errors are identical at every epoch.
From what I see, this is what happen which causes the bug :
1-The script is asked for a subset of the train set as training data.
2-The script looks on the file system and sees that there is no numpy file defined for the training set so it parses the train.csv and takes a subset of it which it saves on the file system and returns to the caller.
3-The script is asked for another subset of the train set as validation data.
4-The script looks on the file system and sees that there is a numpy file defined for the training set (the subset of the train set used as training data) so it loads it and returns it to the caller.
Perhaps a better approach would be to simply parse the whole train.csv file and save it as a numpy file in the same folder as the dataset script. Then, when the dataset script is called, it simply loads the numpy file and return a subset of it (defined by start and stop) to the caller.
I actually wrote a dataset class that does what Pierre-Luc is suggesting for a different Kaggle contest. Vincent, you can probably copy-paste this and get what you want with minimal tweaks:
https://github.com/lisa-lab/pylearn2/blob/master/pylearn2/scripts/icml_2013_wrepl/emotions/emotions_dataset.py
I’m on it. It will be done shortly
Bug fix done. Thank you Pierre-Luc for spotting this one out.
For the key points competition you can find details on my blog http://archambaultv.com/?p=201 about a few goodies that will help you get started.
– A dataset class for Pylearn 2
– A submission script just like the one Ian did for the first contest
– A script that will superpose the keypoints computed by your model on the images, so you can visualize your results.
Also update your PyLearn2 repository because there is now a new cost function (MissingTargetCost) that can handle missing key points. Yes some images don’t have all the key points provided.
Don’t hesitate to contact me if you need help or find any bug.
Remember the old saying : Each citation brings you closer to mindfulness.
Thanks for providing these resources!
When I tried training a model using the yaml file you provided, I was not able to load the dataset from /data/lisatmp/ift6266h13/ContestDataset. The error I got was:
AttributeError: Could not evaluate keypoints_dataset.FacialKeypointDataset. Did you mean __package__? Original error was ‘module’ object has no attribute ‘FacialKeypointDataset’
I’m thinking this might be caused by permissions, because when I try listing the directory /data/lisatmp/ift6266h13/ContestDataset/keypoints_dataset, I get a “Permission denied” error.
Any ideas on how I can resolve this?
This is fixed now. I’d given group read permissions to the files, but hadn’t noticed the group they belonged to was Yoshua’s research group. I chowned them to belong to the class group. Sorry about that.
When I presented maxout in class, Pierre-Luc asked if I’d ever tried training the universal approximator version of it. I said no, because while a network with 2 units and many pieces per unit can approximate any function, it might be hard to train such a network or to have it generalize well. I tried training it today just to see what would happen. I used 2 units and 600 pieces per unit. It got stuck at about 35% validation error on MNIST.
For those of you who use the Transformer module, I suggest that you drop the noisy transformations GaussianNoise, Sharpening and Denoizing. Try again your best models, you might get better results. I will post my results on best transformations before the weekend.
If anyone is interested, I have added a BC01 implementation of Alex Krizhevsky’s Local Response Normalization technique. You can see the details on my blog if you are interested : http://plcift6266.wordpress.com/2013/03/10/bc01-local-response-normalization/
I merged my transformation script with the work of Pierre-Luc. It now contains 9 transformation functions : Translation, Scaling, Rotation, Horizontal flipping, Occlusion, Half face, Gaussian Noise, Sharpening and Denoizing.
Take a look at my blog for more information (http://bouthilx.wordpress.com/2013/03/04/blow-up-the-training-set-but-not-my-ram/)
The /data/lisatmp and /data/lisa filesystems are offline today until Monday. The admins need to run fsck on the filesystems to look for data corruption caused by last week’s power outage. (We know that there was some for sure, but hopefully the RAID rebuild fixed it) This means you won’t be able to work on the contest using the lisa machines right now.
Hi all,
I wrote some code to perform on-the-fly transformations on training data. The transformations implemented so far are translation, rotation and horizontal reflexion. I have a small post explaining it on my blog (http://plcift6266.wordpress.com/) and the code is available on my github repository (https://github.com/carriepl/ift6266h13/tree/master/code/transformations).
Feel free to use it if you think it might be useful to you. Simply look at the example.yaml file to see how to do it and ensure that your PYTHONPATH environnement variable includes the folder in which you put the 2 python scripts which perform the transformations.
I have updated the ContestDataset module, so if you have your own copy of it be sure to do an update. The latest change makes it possible to emit the data with different axes so you can use either theano or cuda_convnet for convolution. It also fixes a bug in make_submission.py where the submission script gave matrix inputs to models requesting 4-tensor inputs.
Hi I’m going to report my progress in that blog page: http://caglarift6266.wordpress.com/ you can ignore the previous one.
When we apply random transformations to images, should we seed the random function such that we always get the same transformations given the same setting?
Yes, using a known seed is always a good idea. It doesn’t hurt anything (because you can always run the same script with several different seeds) and having a deterministic script makes it easier to reproduce your results, and to debug several kinds of problems.
I created a python script to enable easy comparison of the performance of a bunch of models. I find it particularly useful when launching a bunch of jobs on the cluster. There’s a post on my blog describing it’s usage and the code is on my github repository
My blog :
http://plcift6266.wordpress.com/
My code repository :
https://github.com/carriepl/ift6266h13
I added to possibility to sort multiple attributes at the same time. I find it pretty usefull.
Thank you for your script!
This looks like a good script. Would one of you like to make a pull request to the main pylearn2 repository to add it to pylearn2/scripts?
Fred has set up databases for people that want to use jobman. (jobman should not be necessary since you are only using one computer cluster, but it can be helpful for analyzing hyperparameter search experiments)
I can’t post the login information here, because this site is viewable by the public and we only want class students to be able to login. But I have created a file on the LISA filesystem, /data/lisatmp/ift6266h13/database.txt that gives the login information.
I wrote a simple script to generate yaml files given a template file and a hyper-parameter configuration file.
bouthilx.wordpress.com/2013/02/21/yaml-file-generator/
https://github.com/bouthilx/ift6266kaggle/tree/master/gen_yaml
Repo : https://github.com/mttdg05/ift6266
Blog : http://ift6266diagnema.wordpress.com/
A further comment about preprocessing:
Pierre-Luc asked me why I made the default to be fit_preprocessor=False, since this seems like it disables learning-based preprocessors.
The answer is that most of the time you run a yaml file, you are training on 80% of the training data, and validating on the remaining 20%. If the preprocessor fits itself in this context, you’ll end up with a preprocessor fit on 80% of the training data. What you probably want to have is a preprocessor fit on all of the training and all of the test data. To do this, you should have a separate script train the preprocessor and save it in a .pkl file. Then use
preprocessor: !pkl: “my_preprocessor.pkl”
to load it in your train experiment. The fit_preprocessor=False default makes sure that you don’t accidentally re-train the preprocessor and blow away everything it learned on the rest of the data.
If you are a Polytechnique student you have an IRO login, but you may not know it yet. If you don’t know your login you will need to see Bernard in Pavillon Andre-Aisenstadt 3221.
I’ve updated the ContestDataset repository to support two new arguments, “fit_preprocessor” and “fit_test_preprocessor”. If fit_preprocessor is true, the preprocessor can be fit while making that dataset. If fit_test_preprocessor is true, then when you call get_test_set on the first dataset, the preprocessor will be refit on the test set.
I found that re-fitting on the test set worked better than keeping the preprocessing fixed between train and test, at least when using pylearn2.datasets.preprocessing.Standardize with my MLP setup.
I get the error rate on my validation set 0.19 but my test score is 0.169. So I look into the submision.csv I find that all the examples are classified as 2 or 5.
Possible reason is the test data loaded in the make_submission.py is not normalized. But How can I told the make_submission.py to load the data from an pkl file ?
here is my make_dataset.py https://github.com/cccrystalyy/ift6266/blob/master/make_dataset.py
If your preprocessing doesn’t involve any learning (i.e., if you just divide each example by its norm or something like that) then if you specify the “preprocessor” argument of ContestDataset in your yaml file for the training phase, everything should just work.
make_submission.py loads the model, looks at model.dataset_yaml_src, and re-parses the yaml to obtain the dataset the model it was trained on. It then calls dataset.get_test_set() to get the test set. ContestDatset.get_test_set will pass the same preprocessor argument to the test set constructor as was originally passed to the train set constructor.
If your preprocessing does involve learning (i.e., if you compute the variance of a feature on the training set) then I admit the current interface doesn’t support that yet. Probably the easiest way to add that would be to edit the submission script to take a second argument specifying a yaml file to load. Change it to check the length of sys.argv and if there is a 3rd argument, use pylearn2.config.yaml_parse.load_path to get the dataset from that argument instead of calling pylearn2.config.yaml_parse.load on model.dataset_yaml_src as it does now.
It looks like I’m currently at the top of the leaderboard without having used any learning in my preprocessing. My submission used
preprocessor: !obj:pylearn2.datasets.preprocessing.GlobalContrastNormalization {}
if you’d like to try that.
Thanks but thats not what I’m asking… As I said above: I want to know how to told the make_submission.py to load the data from an pkl file instead of load from model? Because I dont want to retrain the model(Too long ..)
Anyway I will try the preprocessor.
I understood what you were asking. This was the part of my comment that explains how to load a different dataset:
If your preprocessing does involve learning (i.e., if you compute the variance of a feature on the training set) then I admit the current interface doesn’t support that yet. Probably the easiest way to add that would be to edit the submission script to take a second argument specifying a yaml file to load. Change it to check the length of sys.argv and if there is a 3rd argument, use pylearn2.config.yaml_parse.load_path to get the dataset from that argument instead of calling pylearn2.config.yaml_parse.load on model.dataset_yaml_src as it does now.
You can use the yaml file to load a .pkl file using the yaml syntax:
!pkl: “/path/to/pkl/file.pkl”
If you want you could do the edits I recommend, but call pylearn2.serial.load directly on argument 3 to make_submission.py and assume that argument will be a pkl file. That will accomplish what you’re trying to do but will be somewhat less general than making it load arbitrary yaml.
I immigrate my blog to wordpress:
http://ift6266h13yangyang.wordpress.com/
Regarding the contest, I have noticed a considerable difference between the misclassification rate I obtain on my validation set and the score I get on the test set when I submit my results on Kaggle (this score being based on classification accuracy). At the moment, my best models obtain a misclassification rate around 0.35 on the validation set, whereas my best score on the test set is 0.33282 in terms of classification accuracy. I am simply wondering if anyone else has had similar results.
Yes, the test dataset is much harder than training data. The training data is based on manually aligned images of people who were paid to act out each emotion and photographed under careful conditions. The test dataset is based on Google image search results for different emotion keywords, and pictures of LISA lab members taken under a wide range of conditions with different webcams.
I actually got a validation set score of over 80% accuracy but my leaderboard score is only 36%.
Good to know. Thanks!
My blog is at : http://archambaultv.com
My repository for all code related to this course is : https://github.com/archambaultv/IFT6266
My repo and wiki:
https://github.com/nicholas-leonard/ift6266
I have this error message while using contest data:
Forcing DISTUTILS_USE_SDK=1
Building dataset from yaml…
…done
norms of examples:
min: 2626.38192196
mean: 5932.80437406
max: 11492.1476235
range of elements of examples (0.0, 255.0)
dtype: float64
(48L, 48L)
Traceback (most recent call last):
File “H:\GitHubCode\pylearn2\pylearn2\scripts\show_examples.py”, line 110, in
pv = patch_viewer.PatchViewer( (rows, cols), examples.shape[1:3], is_color = is_color)
File “H:\Python27\lib\site-packages\pylearn2-0.1dev-py2.7.egg\pylearn2\gui\patch_viewer.py”, line 62, in __init__
assert isinstance(elem,int)
AssertionError
Update pylearn2. I’ve improved the error message to give some more information. I’m guessing numpy behaves differently on Windows somehow.
New error :
Forcing DISTUTILS_USE_SDK=1
Building dataset from yaml…
…done
norms of examples:
min: 2626.38192196
mean: 5932.80437406
max: 11492.1476235
range of elements of examples (0.0, 255.0)
dtype: float64
(48L, 48L)
Traceback (most recent call last):
File “H:\GitHubCode\pylearn2\pylearn2\scripts\show_examples.py”, line 110, in
pv = patch_viewer.PatchViewer( (rows, cols), examples.shape[1:3], is_color = is_color)
File “h:\githubcode\pylearn2\pylearn2\gui\patch_viewer.py”, line 63, in __init__
raise ValueError(“Expected grid_shape and patch_shape to be pairs of ints, but they are %s and %s respectively.” % (
str(grid_shape), str(patch_shape)))
ValueError: Expected grid_shape and patch_shape to be pairs of ints, but they are (20, 20) and (48L, 48L) respectively.
OK, it looks like on your installation, numpy sometimes specifies shapes as python longs, which I’ve never even heard of before. I changed the check to allow longs, so update pylearn2 again.
Running any of the three pylearn tutorials mentioned on this page:
“https://github.com/lisa-lab/pylearn2/tree/master/tutorials” including the multilayer_perceptron.ipynb
I get the following message:
“An error occurred while loading this notebook. Most likely this notebook is in a newer format than is supported by this version of IPython. This version can load notebook formats v3 or earlier.”a
Ian and David checked it out but couldn’t figure out the problem. Fred is notified on the issue. Hopefully, he know how to resolve it.
My repo: https://github.com/gbcolborne/ift6266
My blog: http://ift6266gbc.wordpress.com/
The way the contest setup works, the public test dataset does not include any labels. You can’t use this as a validation set during training to do things like early stopping. To do that you must make your own validation set from a subset of the training set.
I’ve updated the ContestDataset repository to include “start” and “stop” parameters that you can use to load a subset of the dataset. You can use this to make your own validation set.
To see how well you’re doing on the contest, you can run the new make_submission.py script on one of your trained models. This will make a .csv file containing your model’s estimate of the correct labels on the public test set. You can then upload this .csv file to the kaggle site to find out how well you scored.
When using the LISA computers, you should add the following to your ~/.bashrc file:
if [ -e “/opt/lisa/os/.local.bashrc” ];then
source /opt/lisa/os/.local.bashrc
elif [ -e /data/lisa/data/local_export/.local.bashrc ];then
source /data/lisa/data/local_export/.local.bashrc
fi
If your ipython is out of date, Fred recommends that you upgrade to 0.13 rather than trying to keep using the old version. ipython notebooks have been enhanced a lot in the latest version.
You can get version 0.13 by running:
pip install -U ipython
If you don’t have pip yet, you can install it with apt-get.
My Repo:
https://github.com/SinaHonari/ift6266.git
My Blog:
https://ift6266sina.wordpress.com/
To use a computer at LISA, Fred says to use maggie46. You can log in remotely via elisa1@iro.umontreal.ca then ssh further to maggie46.
Hi,
My blog :
http://plcift6266.wordpress.com/
My code repository :
https://github.com/carriepl/ift6266h13
Blog:
https://bouthilx.wordpress.com/
Repository:
https://github.com/bouthilx/ift6266kaggle
My research blog is the following: http://geoffroyift6266.wordpress.com/
And I will put my files here: https://github.com/geoffroymouret/ift6266
I’ve created a python module for accessing the Kaggle dataset. On a DIRO computer, just add
/data/lisatmp/ift6266h13/ContestDataset/
to your PYTHONPATH environment variable. This will allow you to import the contest_dataset module.
On the DIRO machines, there is no need for you to downloade the Kaggle dataset; you can just use the files I downloaded.
If you want to use a different setup, such as your own machine, you can get the python module by cloning my github repository:
https://github.com/goodfeli/ContestDataset
Different version of iPython notebook don’t compatible with each other, this could cause some problems .
If you’re using an outdated version of ipython, you won’t be able to read ipython notebooks made using the latest version. But people with the latest version of ipython will be able to read yours.
I use the newest version of EPU distribution which has ipython 0.12.1 and python 2.7.3, which version is used by UdeM?
What is EPU? Do you mean EPD? The latest version of EPD has ipython 0.13. That is what I used the DIRO machines to generate the notebooks for the class.
*EPD, Ok i will check for update
Each student should make a research journal that can be shared with the class.
You can do this in one of two ways: either with an ipython notebook or a wordpress blog. Both support using images and latex equations. ipython also supports executable python code. Feel free to choose either method.
WordPress instructions:
1. Go to http://wordpress.com/#!/my-blogs/
2. Click “Create a New Blog” and follow the instructions
3. Post the link to your blog here
4. Read http://en.support.wordpress.com/latex/ to see how to post latex on your blog.
ipython notebook instructions:
1. Create an account on github.com
2. Click “Create a new repository”
3. Follow the instructions. Check “Initialize this repository with a README”
4. In a bash shell, use “git clone” to check out your new git repository.
5. From inside the repository, use “ipython notebook” to open the ipython notebook editor. Save a notebook to the repository.
6. Use git to push the notebook back to your github account.
7. Post a link to your github repository here so we can follow your updates.
8. Check out some example notebooks to get an idea of the syntax: http://nbviewer.ipython.org/
Hi,
The blog that I’m going to take my notes is this:
http://caglar.codedanger.com/
and the github repository that I’m going to put my codes is here:
https://github.com/caglar/ift6266-project
My Repo is https://github.com/cccrystalyy/ift6266
my notes will be here:http://silentlistener.me/blog/