Friday 25 July 2014

Things to try after useR! - Part 1: Deep Learning with H2O



Annual R User Conference 2014

The useR! 2014 conference was a mind-blowing experience. Hundreds of R enthusiasts and the beautiful UCLA campus, I am really glad that I had the chance to attend! The only problem is that, after a few days of non-stop R talks, I was (and still am) completely overwhelmed with the new cool packages and ideas.

Let me start with H2O - one of the three promising projects that John Chambers highlighted during his keynote (the other two were Rcpp/Rcpp11 and RLLVM/RLLVMCompile).

What's H2O?

"The Open Source In-Memory, Prediction Engine for Big Data Science" - that's what Oxdata, the creator of H2O, said. Joseph Rickert's blog post is a very good introduction of H2O so please read that if you want to find out more. I am going straight into the deep learning part.

Deep Learning in R

Deep learning tools in R are still relatively rare at the moment when compared to other popular algorithms like Random Forest and Support Vector Machines. A nice article about deep learning can be found here. Before the discovery of H2O, my deep learning coding experience was mostly in Matlab with the DeepLearnToolbox. Recently, I have started using 'deepnet', 'darch' as well as my own code for deep learning in R. I have even started developing a new package called 'deepr' to further streamline the procedures. Now I have discovered the package 'h2o', I may well shift the design focus of 'deepr' to further integration with H2O instead!

But first, let's play with the 'h2o' package and get familiar with it.

The H2O Experiment

The main purpose of this experiment is to get myself familiar with the 'h2o' package. There are quite a few machine learning algorithms that come with H2O (such as Random Forest and GBM). But I am only interested in the Deep Learning part and the H2O cluster configuration right now. So the following experiment was set up to investigate:
  1. How to set up and connect to a local H2O cluster from R.
  2. How to train a deep neural networks model.
  3. How to use the model for predictions.
  4. Out-of-bag performance of non-regularized and regularized models.
  5. How does the memory usage vary over time.

Experiment 1: 

For the first experiment, I used the Wisconsin Breast Cancer Database. It is a very small dataset (699 samples of 10 features and 1 label) so that I could carry out multiple runs to see the variation in prediction performance. The main purpose is to investigate the impact of model regularization by tuning the 'Dropout' parameter in the h2o.deeplearning(...) function (or basically the objectives 1 to 4 mentioned above).

Experiment 2: 

The next thing to investigate is the memory usage (objective 5). For this purpose, I chose a bigger (but still small in today's standards) dataset MNIST Handwritten Digits Database (LeCun et al.). I would like to find out if the memory usage can be capped at a defined allowance over long period of model training process.

Findings

OK, enough for the background and experiment setup. Instead of writing this blog post like a boring lab report, let's go through what I have found out so far. (If you want to find out more, all code is available here so you can modify it and try it out on your clusters.)

Setting Up and Connecting to a H2O Cluster

Smoooooth! - if I have to explain it in one word. Oxdata made this really easy for R users. Below is the code to start a local cluster with 1GB or 2GB memory allowance. However, if you want to start the local cluster from terminal (which is also useful if you see the messages during model training), you can do this java -Xmx1g -jar h2o.jar (see the original H2O documentation here).

By default, H2O starts a cluster using all available threads (8 in my case). The h2o.init(...) function has no argument for limiting the number of threads yet (well, sometimes you do want to leave one thread idle for other important tasks like Facebook). But it is not really a problem.

Loading Data

In order to train models with the H2O engine, I need to link the datasets to the H2O cluster first. There are many ways to do it. In this case, I linked a data frame (Breast Cancer) and imported CSVs (MNIST) using the following code.


Training a Deep Neural Network Model

The syntax is very similar to other machine learning algorithms in R. The key differences are the inputs for x and y which you need to use the column numbers as identifiers.


Using the Model for Prediction

Again, the code should look very familiar to R users.


The h2o.predict(...) function will return the predicted label with the probabilities of all possible outcomes (or numeric outputs for regression problems) - very useful if you want to train more models and build an ensemble.

Out-of-Bag Performance (Breast Cancer Dataset)



No surprise here. As I expected, the non-regularized model overfitted the training set and performed poorly on test set. Also as expected, the regularized models did give consistent out-of-bag performance. Of course, more tests on different datasets are needed. But this is definitely a good start for using deep learning techniques in R!

Memory Usage (MNIST Dataset)



This is awesome and really encouraging! In near idle mode, my laptop uses about 1GB of memory (Ubuntu 14.04). During the MNIST model training, H2O successfully kept the memory usage below the capped 2GB allowance over time with all 8 threads working like a steam train! OK, this is based on just one simple test but I already feel comfortable and confident to move on and use H2O for much bigger datasets.

Conclusions

OK, let's start from the only negative point. The machine learning algorithms are limited to the ones that come with H2O. I cannot leverage the power of other available algorithms in R yet (correct me if I am wrong. I will be very happy to be proven wrong this time. Please leave a comment on this blog so everyone can see it). Therefore, in terms of model choices, it is not as handy as caret and subsemble.

Having said that, the included algorithms (Deep Neural Networks, Random Forest, GBM, K-Means, PCA etc) are solid for most of the common data mining tasks. Discovering and experimenting with the deep learning functions in H2O really made me happy. With the superb memory management and the full integration with multi-node big data platforms, I am sure this H2O engine will become more and more popular among data scientists. I am already thinking about the  Parallella project but I will leave it until I finish my thesis.

I can now understand why John Chambers recommended H2O. It has already become one of my essential R tools for data mining. The deep learning algorithm in H2O is very interesting, I will continue to explore and experiment with the rest of the regularization parameters such as 'L1', 'L2' and 'Maxout'.

Code

As usual, code is available at my GitHub repo for this blog.

Personal Highlight of useR! 2014

Just a bit more on useR! During the conference week, I met so many cool R people for the very first time. You can see some of the photos by searching #user2014 and my twitter handle together. Other blog posts about the conference can be found here, herehere, herehere and here. For me, the highlight has to be this text analysis by Ajay:
... which means I successfully made Matlab trending with R!!! 

During the conference banquet, Jeremy Achin (from DataRobot) suggested that I might as well change my profile photo to a Python logo just to make it even more confusing! It was also very nice to speak to Matt Dowle in person and to learn about his amazing data.table journey from S to R. I have started updating some of my old code to use data.table for the heavy data wrangling tasks.

By the way, Jeremy and the DataRobot team (a dream team of top Kaggle data scientists including Xavier who gave a talk about "10 packages to Win Kaggle Competitions") showed me an amazing demo of their product. Do ask them for a beta account and see for yourself!!!

There are more cool things that I am trying at the moment. I will try to blog about them in the near future. If I have to name a few right now ... that will be:

(Pheeew! So here is my first blog post related to machine learning - the very purpose of starting this blog. Not bad it finally happened after a whole year!)

50 comments:

  1. It’s really a nice and useful piece of information. I am glad that you shared this useful information with us. Please keeps us to date like this .thank you for sharing.

    Python in-house Corporate training in Nigeria

    ReplyDelete
  2. Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
    Business Analytics Training in Hyderabad
    Business Analytics Course in Hyderabad

    ReplyDelete
  3. Thanks for sharing this,.
    Leanpitch provides online training in Advanced Scrum Master during this lockdown period everyone can use it wisely.
    Advanced Scrum Master training

    ReplyDelete
  4. Helpful content,Thanks for sharing..
    Leanpitch provides online training in Advanced Scrum Master during this lockdown period everyone can use it wisely.
    Advanced Scrum Master Training Online

    ReplyDelete
  5. Thanks for sharing this article
    To crack scrum master interview : Scrum Master Interview Questions

    ReplyDelete
  6. Great work , helpful content
    Leanpitch provides online training in Advanced Scrum Master during this lockdown period everyone can use it wisely.
    Read this blog : Scrum Master Interview Questions

    ReplyDelete


  7. Top quality blog with unique content and found valuable looking forward for next updated thank you
    Ethical Hacking Course in Bangalore

    ReplyDelete
  8. Much thanks for composing such an intriguing article on this point. This has truly made me think and I plan to peruse more
    business analytics course

    ReplyDelete

  9. Truly overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. Much obliged for sharing.business analytics course

    ReplyDelete
  10. Thanks for sharing this informative content , Great work
    Read this blog to crack scrum master interview : Scrum Master Interview Questions

    ReplyDelete
  11. Thanks for sharing this informative content , Great work
    Devops Online Training
    Leanpitch provides online training in Devops during this lockdown period everyone can use it wisely.

    ReplyDelete
  12. Thanks for sharing this informative content , Great work
    Leanpitch provides online certification in scrum master during this lockdown period everyone can use it wisely.
    Scrum master certification

    ReplyDelete
  13. Thanks for sharing this informative content , Great work
    Leanpitch provides online training in devops during this lockdown period everyone can use it wisely.
    Devops Online Training

    ReplyDelete
  14. I think such material you should post in video format. You can easy post such video on youtube, for example. If you worry about likes or comments don't do it. From this site https://viplikes.net you can buy youtube comments and likes very fast

    ReplyDelete
  15. Thanks for sharing this.,
    Leanpitch provides online training in Scrum Master during this lockdown period everyone can use it wisely.
    Join Leanpitch 2 Days CSM Certification Workshop in different cities.


    CSM online training

    ReplyDelete
  16. Thanks for sharing this.,
    Leanpitch provides online training in Scrum Master during this lockdown period everyone can use it wisely.
    Join Leanpitch 2 Days CSM Certification Workshop in different cities.

    CSM online certification

    ReplyDelete
  17. This comment has been removed by the author.

    ReplyDelete
  18. It would be great if you create tiktok profile and post video with your experiments and video from your conference. I read from here https://www.elmens.com/tech/the-service-to-start-with-how-to-gain-hundreds-and-thousands-of-followers-on-tiktok-within-24-to-72-hours/ that you can get many followers for your profile quite fast

    ReplyDelete
  19. Pretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.

    best data science institute in hyderabad

    ReplyDelete
  20. This comment has been removed by the author.

    ReplyDelete
  21. This comment has been removed by the author.

    ReplyDelete
  22. Thanks for sharing this.,
    Leanpitch provides online training in Scrum Master Certifiation during this lockdown period everyone can use it wisely.
    Join Leanpitch 2 Days CSM Certification Workshop in different cities.
    CSM online training

    CSM training online

    ReplyDelete
  23. I am genuinely thankful to the holder of this web page who has shared this wonderful paragraph at at this place
    digital marketing courses in hyderabad with placement


    ReplyDelete
  24. Thanks for sharing this.,
    Leanpitch provides crash course in Facilitating change everyone can use it wisely.

    Facilitating change

    Facilitating change in the workplace

    ReplyDelete

  25. Thanks for sharing this informative content.,
    Turient is an All-in-one platform for all our teaching needs. If Teaching is your passion ,enabling is ours
    Read the Informative blog - 11 Free Teaching Tools for Online Teachers

    11 Free Teaching Tools for Online Teachers
    Free Teaching Tools

    ReplyDelete
  26. Do you want to make video for youtube about it? You can easy promote your channel if you start to comment other video with such topic and get likes for your comments from this site https://soclikes.com/buy-youtube-comment-likes

    ReplyDelete
  27. Register now to participate in the intensive Artificial Intelligence Course in Hyderabad program taught by experts at the AI Patasala training center.

    ReplyDelete
  28. Thanks for sharing this awesome blogs with us. Keep sharing more.
    AI Patasala Artificial Intelligence Courses

    ReplyDelete
  29. I really thank you for the valuable info on this great subject and look forward to more great posts
    data scientist certification malaysia

    ReplyDelete
  30. This comment has been removed by the author.

    ReplyDelete
  31. Become a Data Science expert with Innomatics. we provide classroom training on Data Science course in Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices.

    ReplyDelete
  32. AI Patasala's Data Science Course in Hyderabad with Placements is the ideal option for data science enthusiasts. If you want to become an expert in Data Science, AI Patasala is the best option for you.
    Data Science Training in Hyderabad

    ReplyDelete
  33. wow, great, I was wondering how to cure acne naturally. and found your site by google, learned a lot, now i’m a bit clear. I’ve bookmark your site and keep us updated.
    cyber security course malaysia

    ReplyDelete
  34. Nice knowledge gaining article. This post is really the best on this valuable topic.
    data science training in malaysia

    ReplyDelete
  35. Thanks for the informative and helpful post, obviously in your blog everything is good.
    data science course

    ReplyDelete
  36. If it's not too much trouble share more like that. data science training in mysore

    ReplyDelete
  37. Register for the Data Scientist courses in Bangalore and learn to build your Data Science and Machine learning workflows. Build a portfolio of work to have on your resume with live projects which are supported by an industry-relevant curriculum. Get Access to our learning management system (LMS) that provides you with all the material and assignments that will help you master all the concepts for you to solve any problem related to deciphering the hidden meaning in data.

    Data Science Course in Bangalore with Placement

    ReplyDelete
  38. Are you looking for a Data Science certification course to start your career in Data Science then 360DigiTMG is all you need. With experienced professional trainers and real-time projects, you can improve your skillset and move ahead in your profession. Why wait to enroll with us now.


    Best Data Science Training institute in Bangalore

    ReplyDelete
  39. Are you not ready to risk your and your family's health this pandemic time by joining an offline Data Analyst course. we have a solution for you, enroll in an online Data Analyst course that will equip you with all the knowledge needed for a job in just 6 months.

    Data Science Course in Jaipur

    ReplyDelete
  40. Are you not ready to risk your and your family's health this pandemic time by joining an offline Data Analyst course. we have a solution for you, enroll in an online Data Analyst course that will equip you with all the knowledge needed for a job in just 6 months.

    Business Analytics Course in Jodhpur

    ReplyDelete
  41. Get the best Data Science online course at the comfort of your home. Flexible timings, Best industry trainers, and meticulously crafted curriculum. Avail now!!!!
    data scientist certification malaysia

    ReplyDelete
  42. Well, I really appreciated for your great work. This topic submitted by you is helpful and keep sharing...
    Best Divorce Lawyers in Arlington VA
    Divorce Attorney in Fairfax
    Fairfax Divorce Lawyers

    ReplyDelete
  43. The information you have posted is very useful. The sites you have referred was good. Thanks for sharing.
    full stack web development course malaysia

    ReplyDelete
  44. Are you interested in learning Power BI and how it can help you unlock the potential of your data? Power BI Course Malaysia is the perfect way to get yourself up to speed with the latest technology. Whether you are looking for a comprehensive course to help you quickly master this powerful software or just need a refresher, we have the perfect solution for you. Explore our range of courses and take advantage of our educational resources - from tutorials and webinars to online classes and live events - so that you can gain the skills and knowledge needed to become a Power BI expert.
    power bi course malaysia

    ReplyDelete