The practical part of this course is grounded in Kaggle competitions organized specifically for the class.

 

The first competition aims at classifying emotions on faces:

https://inclass.kaggle.com/c/facial-keypoints-detector

 

The second competition aims at identifying keypoints on the face images:

https://inclass.kaggle.com/c/facial-keypoint-detection

 

The duty of each student is to create an experiments journal (could be a WordPress blog) documenting their weekly effort towards implementing and experimenting with representation-learning algorithms on the provided datasets. A large part of the grade assigned for the practical part of the course will depend on that journal. There are three criteria by which these journals will be evaluated:

  • Effort and completeness: the journal documents your efforts, its completeness and regularity of entries are important.
  • Understandability: other graduate students of the same class should be able to understand what you did, the main ideas, their motivations, and the main conclusions of your observations. If you were to write a paper about these experiments, this would be the main criteria of evaluation. There should be executive summaries of the main results as you progress along the way (the stuff that would go in abstracts or conclusions of scientific papers).
  • Reproducibility: other graduate students of the same class should be able to reproduce your experiments and get sensibly the same results, without having to do to much work. That means your code and scripts should be publicly available in some web-based repository, and that the description of experiments (e.g., settings of hyper-parameters) are clear enough and clearly linked with the results obtained.

74 Responses to “Explore”


  1. 1 zhaoyangyang April 30, 2013 at 18:21

    Did anyone successfully train the keypoint data with unsupervised pretraining ?

    I tried several structure: RBM-RBM-RBM-Softmax, DAE-DAE-DAE-Softmax,
    RBM-DAE-DAE-Softmax.

    None of them give the reasonable result.. I dont know its because my configuration is not optimal or some other problems. If anyone successfully train the data with unsupervised pretraining, can you give me the link to your repo, thanks!

  2. 2 Ian Goodfellow April 13, 2013 at 20:18

    Yoshua and I are organizing a workshop at the International Conference on Machine Learning, along with UdeM alumnus Dumitru Erhan. As part of the workshop, we’re hosting three different Kaggle competitions:

    http://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/

    http://www.kaggle.com/c/challenges-in-representation-learning-multi-modal-learning

    http://www.kaggle.com/c/challenges-in-representation-learning-the-black-box-learning-challenge/leaderboard

    First prize for each of them is $350 + a speech at the workshop.

    This is not part of the class at all, but your experience in this class has prepared you to do well in any of these contests if you are interested. The facial expression recognition challenge in particular is very similar to the challenge you completed for the first half of this course, but using a different dataset.

  3. 3 Pierre Luc Carrier April 11, 2013 at 11:20

    I found something else worth improving in the dataset script. The data points in the train.csv are organised by source dataset (the train.csv has been assembled from data coming from 4 distinct datasets). This means that the first n1 lines in the csv come from the first source dataset, the next n2 lines come from the second source dataset, etc. On top of that, the dataset script does not shuffle the dataset before splitting it into training and validation.

    Together, these 2 facts mean that when you split that data into a training and a validation set, your validation set only contains data from the last source dataset. This makes the validation error a very bad proxy for the public_test error because the public_test dataset actually contains data from all 4 source dataset.

    I already mentionned this to Vincent Archambault and he is currently working on a fix for this.

  4. 9 Ian Goodfellow April 8, 2013 at 09:58

    Vincent fixed the bug in the dataset class for the contest. I’ve merged his pull request into the class copy of the dataset, and copied over his numpy files, so that should work fine for everyone now. Let me + Vincent know if you still have trouble.

  5. 10 Vincent Archambault-B April 5, 2013 at 15:14

    I modified the script for loading the dataset. Now it uses a .npy file. This should solve the memory problem. It also loads much faster.

    If you use the Lisa lab computers you don’t need to change anything. If you work from home, update your ContestDataset repository.

    • 11 Pierre Luc Carrier April 7, 2013 at 13:42

      Hey, it seems your script has a bug; when the .yaml file defines a training and a validation set, both taken from which_set=’train’, the dataset script returns the same data for both the training and validation set. When I use this new version of your script with your baseline .yaml, my training and validation errors are identical at every epoch.

      From what I see, this is what happen which causes the bug :
      1-The script is asked for a subset of the train set as training data.
      2-The script looks on the file system and sees that there is no numpy file defined for the training set so it parses the train.csv and takes a subset of it which it saves on the file system and returns to the caller.
      3-The script is asked for another subset of the train set as validation data.
      4-The script looks on the file system and sees that there is a numpy file defined for the training set (the subset of the train set used as training data) so it loads it and returns it to the caller.

      Perhaps a better approach would be to simply parse the whole train.csv file and save it as a numpy file in the same folder as the dataset script. Then, when the dataset script is called, it simply loads the numpy file and return a subset of it (defined by start and stop) to the caller.

  6. 15 Vincent Archambault-B March 28, 2013 at 22:19

    For the key points competition you can find details on my blog http://archambaultv.com/?p=201 about a few goodies that will help you get started.
    – A dataset class for Pylearn 2
    – A submission script just like the one Ian did for the first contest
    – A script that will superpose the keypoints computed by your model on the images, so you can visualize your results.

    Also update your PyLearn2 repository because there is now a new cost function (MissingTargetCost) that can handle missing key points. Yes some images don’t have all the key points provided.

    Don’t hesitate to contact me if you need help or find any bug.

    Remember the old saying : Each citation brings you closer to mindfulness.

    • 16 Gabriel Bernier-Colborne April 1, 2013 at 15:29

      Thanks for providing these resources!

      When I tried training a model using the yaml file you provided, I was not able to load the dataset from /data/lisatmp/ift6266h13/ContestDataset. The error I got was:

      AttributeError: Could not evaluate keypoints_dataset.FacialKeypointDataset. Did you mean __package__? Original error was ‘module’ object has no attribute ‘FacialKeypointDataset’

      I’m thinking this might be caused by permissions, because when I try listing the directory /data/lisatmp/ift6266h13/ContestDataset/keypoints_dataset, I get a “Permission denied” error.

      Any ideas on how I can resolve this?

      • 17 Ian Goodfellow April 2, 2013 at 08:20

        This is fixed now. I’d given group read permissions to the files, but hadn’t noticed the group they belonged to was Yoshua’s research group. I chowned them to belong to the class group. Sorry about that.

  7. 18 Ian Goodfellow March 16, 2013 at 11:18

    When I presented maxout in class, Pierre-Luc asked if I’d ever tried training the universal approximator version of it. I said no, because while a network with 2 units and many pieces per unit can approximate any function, it might be hard to train such a network or to have it generalize well. I tried training it today just to see what would happen. I used 2 units and 600 pieces per unit. It got stuck at about 35% validation error on MNIST.

  8. 19 Xavier Bouthillier March 14, 2013 at 07:26

    For those of you who use the Transformer module, I suggest that you drop the noisy transformations GaussianNoise, Sharpening and Denoizing. Try again your best models, you might get better results. I will post my results on best transformations before the weekend.

  9. 21 Pierre Luc Carrier March 11, 2013 at 09:55

    If anyone is interested, I have added a BC01 implementation of Alex Krizhevsky’s Local Response Normalization technique. You can see the details on my blog if you are interested : http://plcift6266.wordpress.com/2013/03/10/bc01-local-response-normalization/

  10. 22 Xavier Bouthillier March 5, 2013 at 22:27

    I merged my transformation script with the work of Pierre-Luc. It now contains 9 transformation functions : Translation, Scaling, Rotation, Horizontal flipping, Occlusion, Half face, Gaussian Noise, Sharpening and Denoizing.

    Take a look at my blog for more information (http://bouthilx.wordpress.com/2013/03/04/blow-up-the-training-set-but-not-my-ram/)

  11. 23 Ian Goodfellow March 1, 2013 at 10:45

    The /data/lisatmp and /data/lisa filesystems are offline today until Monday. The admins need to run fsck on the filesystems to look for data corruption caused by last week’s power outage. (We know that there was some for sure, but hopefully the RAID rebuild fixed it) This means you won’t be able to work on the contest using the lisa machines right now.

  12. 24 Pierre Luc Carrier February 25, 2013 at 20:22

    Hi all,
    I wrote some code to perform on-the-fly transformations on training data. The transformations implemented so far are translation, rotation and horizontal reflexion. I have a small post explaining it on my blog (http://plcift6266.wordpress.com/) and the code is available on my github repository (https://github.com/carriepl/ift6266h13/tree/master/code/transformations).

    Feel free to use it if you think it might be useful to you. Simply look at the example.yaml file to see how to do it and ensure that your PYTHONPATH environnement variable includes the folder in which you put the 2 python scripts which perform the transformations.

  13. 25 Ian Goodfellow February 25, 2013 at 15:34

    I have updated the ContestDataset module, so if you have your own copy of it be sure to do an update. The latest change makes it possible to emit the data with different axes so you can use either theano or cuda_convnet for convolution. It also fixes a bug in make_submission.py where the submission script gave matrix inputs to models requesting 4-tensor inputs.

  14. 26 Caglar Gulcehre February 25, 2013 at 11:59

    Hi I’m going to report my progress in that blog page: http://caglarift6266.wordpress.com/ you can ignore the previous one.

  15. 27 Xavier Bouthillier February 24, 2013 at 13:33

    When we apply random transformations to images, should we seed the random function such that we always get the same transformations given the same setting?

    • 28 Ian Goodfellow February 24, 2013 at 13:38

      Yes, using a known seed is always a good idea. It doesn’t hurt anything (because you can always run the same script with several different seeds) and having a deterministic script makes it easier to reproduce your results, and to debug several kinds of problems.

  16. 29 Pierre Luc Carrier February 22, 2013 at 17:58

    I created a python script to enable easy comparison of the performance of a bunch of models. I find it particularly useful when launching a bunch of jobs on the cluster. There’s a post on my blog describing it’s usage and the code is on my github repository

    My blog :
    http://plcift6266.wordpress.com/

    My code repository :
    https://github.com/carriepl/ift6266h13

    • 30 Xavier Bouthillier February 23, 2013 at 11:05

      I added to possibility to sort multiple attributes at the same time. I find it pretty usefull.

      Thank you for your script!

      • 31 Ian Goodfellow February 24, 2013 at 13:43

        This looks like a good script. Would one of you like to make a pull request to the main pylearn2 repository to add it to pylearn2/scripts?

  17. 32 Ian Goodfellow February 22, 2013 at 10:39

    Fred has set up databases for people that want to use jobman. (jobman should not be necessary since you are only using one computer cluster, but it can be helpful for analyzing hyperparameter search experiments)

    I can’t post the login information here, because this site is viewable by the public and we only want class students to be able to login. But I have created a file on the LISA filesystem, /data/lisatmp/ift6266h13/database.txt that gives the login information.

  18. 33 Xavier Bouthillier February 21, 2013 at 13:22

    I wrote a simple script to generate yaml files given a template file and a hyper-parameter configuration file.

    bouthilx.wordpress.com/2013/02/21/yaml-file-generator/
    https://github.com/bouthilx/ift6266kaggle/tree/master/gen_yaml

  19. 35 Ian Goodfellow February 19, 2013 at 10:18

    A further comment about preprocessing:

    Pierre-Luc asked me why I made the default to be fit_preprocessor=False, since this seems like it disables learning-based preprocessors.

    The answer is that most of the time you run a yaml file, you are training on 80% of the training data, and validating on the remaining 20%. If the preprocessor fits itself in this context, you’ll end up with a preprocessor fit on 80% of the training data. What you probably want to have is a preprocessor fit on all of the training and all of the test data. To do this, you should have a separate script train the preprocessor and save it in a .pkl file. Then use

    preprocessor: !pkl: “my_preprocessor.pkl”

    to load it in your train experiment. The fit_preprocessor=False default makes sure that you don’t accidentally re-train the preprocessor and blow away everything it learned on the rest of the data.

  20. 36 Ian Goodfellow February 18, 2013 at 16:32

    If you are a Polytechnique student you have an IRO login, but you may not know it yet. If you don’t know your login you will need to see Bernard in Pavillon Andre-Aisenstadt 3221.

  21. 37 Ian Goodfellow February 18, 2013 at 16:30

    I’ve updated the ContestDataset repository to support two new arguments, “fit_preprocessor” and “fit_test_preprocessor”. If fit_preprocessor is true, the preprocessor can be fit while making that dataset. If fit_test_preprocessor is true, then when you call get_test_set on the first dataset, the preprocessor will be refit on the test set.

    I found that re-fitting on the test set worked better than keeping the preprocessing fixed between train and test, at least when using pylearn2.datasets.preprocessing.Standardize with my MLP setup.

  22. 38 zhaoyangyang February 17, 2013 at 22:59

    I get the error rate on my validation set 0.19 but my test score is 0.169. So I look into the submision.csv I find that all the examples are classified as 2 or 5.

    Possible reason is the test data loaded in the make_submission.py is not normalized. But How can I told the make_submission.py to load the data from an pkl file ?

    here is my make_dataset.py https://github.com/cccrystalyy/ift6266/blob/master/make_dataset.py

    • 39 Ian Goodfellow February 17, 2013 at 23:52

      If your preprocessing doesn’t involve any learning (i.e., if you just divide each example by its norm or something like that) then if you specify the “preprocessor” argument of ContestDataset in your yaml file for the training phase, everything should just work.

      make_submission.py loads the model, looks at model.dataset_yaml_src, and re-parses the yaml to obtain the dataset the model it was trained on. It then calls dataset.get_test_set() to get the test set. ContestDatset.get_test_set will pass the same preprocessor argument to the test set constructor as was originally passed to the train set constructor.

      If your preprocessing does involve learning (i.e., if you compute the variance of a feature on the training set) then I admit the current interface doesn’t support that yet. Probably the easiest way to add that would be to edit the submission script to take a second argument specifying a yaml file to load. Change it to check the length of sys.argv and if there is a 3rd argument, use pylearn2.config.yaml_parse.load_path to get the dataset from that argument instead of calling pylearn2.config.yaml_parse.load on model.dataset_yaml_src as it does now.

      It looks like I’m currently at the top of the leaderboard without having used any learning in my preprocessing. My submission used

      preprocessor: !obj:pylearn2.datasets.preprocessing.GlobalContrastNormalization {}

      if you’d like to try that.

      • 40 zhaoyangyang February 18, 2013 at 00:22

        Thanks but thats not what I’m asking… As I said above: I want to know how to told the make_submission.py to load the data from an pkl file instead of load from model? Because I dont want to retrain the model(Too long ..)

        Anyway I will try the preprocessor.

        • 41 Ian Goodfellow February 18, 2013 at 00:30

          I understood what you were asking. This was the part of my comment that explains how to load a different dataset:

          If your preprocessing does involve learning (i.e., if you compute the variance of a feature on the training set) then I admit the current interface doesn’t support that yet. Probably the easiest way to add that would be to edit the submission script to take a second argument specifying a yaml file to load. Change it to check the length of sys.argv and if there is a 3rd argument, use pylearn2.config.yaml_parse.load_path to get the dataset from that argument instead of calling pylearn2.config.yaml_parse.load on model.dataset_yaml_src as it does now.

          You can use the yaml file to load a .pkl file using the yaml syntax:

          !pkl: “/path/to/pkl/file.pkl”

          If you want you could do the edits I recommend, but call pylearn2.serial.load directly on argument 3 to make_submission.py and assume that argument will be a pkl file. That will accomplish what you’re trying to do but will be somewhat less general than making it load arbitrary yaml.

  23. 42 zhaoyangyang February 17, 2013 at 22:24

    I immigrate my blog to wordpress:
    http://ift6266h13yangyang.wordpress.com/

  24. 43 Gabriel Bernier-Colborne February 14, 2013 at 14:11

    Regarding the contest, I have noticed a considerable difference between the misclassification rate I obtain on my validation set and the score I get on the test set when I submit my results on Kaggle (this score being based on classification accuracy). At the moment, my best models obtain a misclassification rate around 0.35 on the validation set, whereas my best score on the test set is 0.33282 in terms of classification accuracy. I am simply wondering if anyone else has had similar results.

    • 44 Ian Goodfellow February 14, 2013 at 14:18

      Yes, the test dataset is much harder than training data. The training data is based on manually aligned images of people who were paid to act out each emotion and photographed under careful conditions. The test dataset is based on Google image search results for different emotion keywords, and pictures of LISA lab members taken under a wide range of conditions with different webcams.

      I actually got a validation set score of over 80% accuracy but my leaderboard score is only 36%.

  25. 46 Vincent Archambault-B February 13, 2013 at 22:49

    My blog is at : http://archambaultv.com

    My repository for all code related to this course is : https://github.com/archambaultv/IFT6266

  26. 48 zhaoyangyang February 13, 2013 at 18:20

    I have this error message while using contest data:

    Forcing DISTUTILS_USE_SDK=1
    Building dataset from yaml…
    …done
    norms of examples:
    min: 2626.38192196
    mean: 5932.80437406
    max: 11492.1476235
    range of elements of examples (0.0, 255.0)
    dtype: float64
    (48L, 48L)
    Traceback (most recent call last):
    File “H:\GitHubCode\pylearn2\pylearn2\scripts\show_examples.py”, line 110, in
    pv = patch_viewer.PatchViewer( (rows, cols), examples.shape[1:3], is_color = is_color)
    File “H:\Python27\lib\site-packages\pylearn2-0.1dev-py2.7.egg\pylearn2\gui\patch_viewer.py”, line 62, in __init__
    assert isinstance(elem,int)
    AssertionError

    • 49 Ian Goodfellow February 13, 2013 at 18:30

      Update pylearn2. I’ve improved the error message to give some more information. I’m guessing numpy behaves differently on Windows somehow.

      • 50 zhaoyangyang February 13, 2013 at 18:33

        New error :

        Forcing DISTUTILS_USE_SDK=1
        Building dataset from yaml…
        …done
        norms of examples:
        min: 2626.38192196
        mean: 5932.80437406
        max: 11492.1476235
        range of elements of examples (0.0, 255.0)
        dtype: float64
        (48L, 48L)
        Traceback (most recent call last):
        File “H:\GitHubCode\pylearn2\pylearn2\scripts\show_examples.py”, line 110, in
        pv = patch_viewer.PatchViewer( (rows, cols), examples.shape[1:3], is_color = is_color)
        File “h:\githubcode\pylearn2\pylearn2\gui\patch_viewer.py”, line 63, in __init__
        raise ValueError(“Expected grid_shape and patch_shape to be pairs of ints, but they are %s and %s respectively.” % (
        str(grid_shape), str(patch_shape)))
        ValueError: Expected grid_shape and patch_shape to be pairs of ints, but they are (20, 20) and (48L, 48L) respectively.

        • 51 Ian Goodfellow February 13, 2013 at 18:37

          OK, it looks like on your installation, numpy sometimes specifies shapes as python longs, which I’ve never even heard of before. I changed the check to allow longs, so update pylearn2 again.

  27. 52 Sina Honari February 13, 2013 at 16:17

    Running any of the three pylearn tutorials mentioned on this page:
    “https://github.com/lisa-lab/pylearn2/tree/master/tutorials” including the multilayer_perceptron.ipynb

    I get the following message:

    “An error occurred while loading this notebook. Most likely this notebook is in a newer format than is supported by this version of IPython. This version can load notebook formats v3 or earlier.”a

    Ian and David checked it out but couldn’t figure out the problem. Fred is notified on the issue. Hopefully, he know how to resolve it.

  28. 54 Ian Goodfellow February 12, 2013 at 17:02

    The way the contest setup works, the public test dataset does not include any labels. You can’t use this as a validation set during training to do things like early stopping. To do that you must make your own validation set from a subset of the training set.

    I’ve updated the ContestDataset repository to include “start” and “stop” parameters that you can use to load a subset of the dataset. You can use this to make your own validation set.

    To see how well you’re doing on the contest, you can run the new make_submission.py script on one of your trained models. This will make a .csv file containing your model’s estimate of the correct labels on the public test set. You can then upload this .csv file to the kaggle site to find out how well you scored.

  29. 55 Ian Goodfellow February 12, 2013 at 14:01

    When using the LISA computers, you should add the following to your ~/.bashrc file:

    if [ -e “/opt/lisa/os/.local.bashrc” ];then
    source /opt/lisa/os/.local.bashrc
    elif [ -e /data/lisa/data/local_export/.local.bashrc ];then
    source /data/lisa/data/local_export/.local.bashrc
    fi

  30. 56 Ian Goodfellow February 12, 2013 at 12:42

    If your ipython is out of date, Fred recommends that you upgrade to 0.13 rather than trying to keep using the old version. ipython notebooks have been enhanced a lot in the latest version.

    You can get version 0.13 by running:
    pip install -U ipython

    If you don’t have pip yet, you can install it with apt-get.

  31. 58 Ian Goodfellow February 11, 2013 at 15:54

    To use a computer at LISA, Fred says to use maggie46. You can log in remotely via elisa1@iro.umontreal.ca then ssh further to maggie46.

  32. 62 Ian Goodfellow February 10, 2013 at 14:16

    I’ve created a python module for accessing the Kaggle dataset. On a DIRO computer, just add
    /data/lisatmp/ift6266h13/ContestDataset/
    to your PYTHONPATH environment variable. This will allow you to import the contest_dataset module.
    On the DIRO machines, there is no need for you to downloade the Kaggle dataset; you can just use the files I downloaded.

    If you want to use a different setup, such as your own machine, you can get the python module by cloning my github repository:
    https://github.com/goodfeli/ContestDataset

  33. 63 zhaoyangyang February 10, 2013 at 13:06

    Different version of iPython notebook don’t compatible with each other, this could cause some problems .

    • 64 Ian Goodfellow February 10, 2013 at 13:14

      If you’re using an outdated version of ipython, you won’t be able to read ipython notebooks made using the latest version. But people with the latest version of ipython will be able to read yours.

  34. 68 Ian Goodfellow February 4, 2013 at 15:58

    Each student should make a research journal that can be shared with the class.

    You can do this in one of two ways: either with an ipython notebook or a wordpress blog. Both support using images and latex equations. ipython also supports executable python code. Feel free to choose either method.

    WordPress instructions:
    1. Go to http://wordpress.com/#!/my-blogs/
    2. Click “Create a New Blog” and follow the instructions
    3. Post the link to your blog here
    4. Read http://en.support.wordpress.com/latex/ to see how to post latex on your blog.

    ipython notebook instructions:
    1. Create an account on github.com
    2. Click “Create a new repository”
    3. Follow the instructions. Check “Initialize this repository with a README”
    4. In a bash shell, use “git clone” to check out your new git repository.
    5. From inside the repository, use “ipython notebook” to open the ipython notebook editor. Save a notebook to the repository.
    6. Use git to push the notebook back to your github account.
    7. Post a link to your github repository here so we can follow your updates.
    8. Check out some example notebooks to get an idea of the syntax: http://nbviewer.ipython.org/


  1. 1 Shuffling the Dataset | Experiments in Representation Learning Trackback on April 23, 2013 at 08:02
  2. 2 Standardization | Experiments in Representation Learning Trackback on February 26, 2013 at 12:03
  3. 3 Architecture and Contrast Normalization | Experiments in Representation Learning Trackback on February 20, 2013 at 15:49
  4. 4 Architecture and Adaptive Learning Rate | Experiments in Representation Learning Trackback on February 15, 2013 at 01:35

Leave a reply to Ian Goodfellow Cancel reply