Annual R User Conference 2014
The useR! 2014 conference was a mind-blowing experience. Hundreds of R enthusiasts and the beautiful UCLA campus, I am really glad that I had the chance to attend! The only problem is that, after a few days of non-stop R talks, I was (and still am) completely overwhelmed with the new cool packages and ideas.Let me start with H2O - one of the three promising projects that John Chambers highlighted during his keynote (the other two were Rcpp/Rcpp11 and RLLVM/RLLVMCompile).
What's H2O?
"The Open Source In-Memory, Prediction Engine for Big Data Science" - that's what Oxdata, the creator of H2O, said. Joseph Rickert's blog post is a very good introduction of H2O so please read that if you want to find out more. I am going straight into the deep learning part.Deep Learning in R
Deep learning tools in R are still relatively rare at the moment when compared to other popular algorithms like Random Forest and Support Vector Machines. A nice article about deep learning can be found here.
Before the discovery of H2O, my deep learning coding experience was mostly in Matlab with the DeepLearnToolbox. Recently, I have started using 'deepnet', 'darch' as well as my own code for deep learning in R. I have even started developing a new package called 'deepr' to further streamline the procedures. Now I have discovered the package 'h2o', I may well shift the design focus of 'deepr' to further integration with H2O instead!
But first, let's play with the 'h2o' package and get familiar with it.
But first, let's play with the 'h2o' package and get familiar with it.
The H2O Experiment
The main purpose of this experiment is to get myself familiar with the 'h2o' package. There are quite a few machine learning algorithms that come with H2O (such as Random Forest and GBM). But I am only interested in the Deep Learning part and the H2O cluster configuration right now. So the following experiment was set up to investigate:
By default, H2O starts a cluster using all available threads (8 in my case). The h2o.init(...) function has no argument for limiting the number of threads yet (well, sometimes you do want to leave one thread idle for other important tasks like Facebook). But it is not really a problem.
No surprise here. As I expected, the non-regularized model overfitted the training set and performed poorly on test set. Also as expected, the regularized models did give consistent out-of-bag performance. Of course, more tests on different datasets are needed. But this is definitely a good start for using deep learning techniques in R!
This is awesome and really encouraging! In near idle mode, my laptop uses about 1GB of memory (Ubuntu 14.04). During the MNIST model training, H2O successfully kept the memory usage below the capped 2GB allowance over time with all 8 threads working like a steam train! OK, this is based on just one simple test but I already feel comfortable and confident to move on and use H2O for much bigger datasets.
- How to set up and connect to a local H2O cluster from R.
- How to train a deep neural networks model.
- How to use the model for predictions.
- Out-of-bag performance of non-regularized and regularized models.
- How does the memory usage vary over time.
Experiment 1:
For the first experiment, I used the Wisconsin Breast Cancer Database. It is a very small dataset (699 samples of 10 features and 1 label) so that I could carry out multiple runs to see the variation in prediction performance. The main purpose is to investigate the impact of model regularization by tuning the 'Dropout' parameter in the h2o.deeplearning(...) function (or basically the objectives 1 to 4 mentioned above).
Experiment 2:
The next thing to investigate is the memory usage (objective 5). For this purpose, I chose a bigger (but still small in today's standards) dataset MNIST Handwritten Digits Database (LeCun et al.). I would like to find out if the memory usage can be capped at a defined allowance over long period of model training process.
Findings
OK, enough for the background and experiment setup. Instead of writing this blog post like a boring lab report, let's go through what I have found out so far. (If you want to find out more, all code is available here so you can modify it and try it out on your clusters.)Setting Up and Connecting to a H2O Cluster
Smoooooth! - if I have to explain it in one word. Oxdata made this really easy for R users. Below is the code to start a local cluster with 1GB or 2GB memory allowance. However, if you want to start the local cluster from terminal (which is also useful if you see the messages during model training), you can do this java -Xmx1g -jar h2o.jar (see the original H2O documentation here).By default, H2O starts a cluster using all available threads (8 in my case). The h2o.init(...) function has no argument for limiting the number of threads yet (well, sometimes you do want to leave one thread idle for other important tasks like Facebook). But it is not really a problem.
Loading Data
In order to train models with the H2O engine, I need to link the datasets to the H2O cluster first. There are many ways to do it. In this case, I linked a data frame (Breast Cancer) and imported CSVs (MNIST) using the following code.
Training a Deep Neural Network Model
The syntax is very similar to other machine learning algorithms in R. The key differences are the inputs for x and y which you need to use the column numbers as identifiers.Using the Model for Prediction
Again, the code should look very familiar to R users.
The h2o.predict(...) function will return the predicted label with the probabilities of all possible outcomes (or numeric outputs for regression problems) - very useful if you want to train more models and build an ensemble.
Out-of-Bag Performance (Breast Cancer Dataset)
No surprise here. As I expected, the non-regularized model overfitted the training set and performed poorly on test set. Also as expected, the regularized models did give consistent out-of-bag performance. Of course, more tests on different datasets are needed. But this is definitely a good start for using deep learning techniques in R!
Memory Usage (MNIST Dataset)
This is awesome and really encouraging! In near idle mode, my laptop uses about 1GB of memory (Ubuntu 14.04). During the MNIST model training, H2O successfully kept the memory usage below the capped 2GB allowance over time with all 8 threads working like a steam train! OK, this is based on just one simple test but I already feel comfortable and confident to move on and use H2O for much bigger datasets.
Conclusions
OK, let's start from the only negative point. The machine learning algorithms are limited to the ones that come with H2O. I cannot leverage the power of other available algorithms in R yet (correct me if I am wrong. I will be very happy to be proven wrong this time. Please leave a comment on this blog so everyone can see it). Therefore, in terms of model choices, it is not as handy as caret and subsemble.
Having said that, the included algorithms (Deep Neural Networks, Random Forest, GBM, K-Means, PCA etc) are solid for most of the common data mining tasks. Discovering and experimenting with the deep learning functions in H2O really made me happy. With the superb memory management and the full integration with multi-node big data platforms, I am sure this H2O engine will become more and more popular among data scientists. I am already thinking about the Parallella project but I will leave it until I finish my thesis.
I can now understand why John Chambers recommended H2O. It has already become one of my essential R tools for data mining. The deep learning algorithm in H2O is very interesting, I will continue to explore and experiment with the rest of the regularization parameters such as 'L1', 'L2' and 'Maxout'.
Having said that, the included algorithms (Deep Neural Networks, Random Forest, GBM, K-Means, PCA etc) are solid for most of the common data mining tasks. Discovering and experimenting with the deep learning functions in H2O really made me happy. With the superb memory management and the full integration with multi-node big data platforms, I am sure this H2O engine will become more and more popular among data scientists. I am already thinking about the Parallella project but I will leave it until I finish my thesis.
I can now understand why John Chambers recommended H2O. It has already become one of my essential R tools for data mining. The deep learning algorithm in H2O is very interesting, I will continue to explore and experiment with the rest of the regularization parameters such as 'L1', 'L2' and 'Maxout'.
Code
As usual, code is available at my GitHub repo for this blog.Personal Highlight of useR! 2014
Just a bit more on useR! During the conference week, I met so many cool R people for the very first time. You can see some of the photos by searching #user2014 and my twitter handle together. Other blog posts about the conference can be found here, here, here, here, here and here. For me, the highlight has to be this text analysis by Ajay:
#User2014 trended thx to: @LouBajuk @guneetc79 @earino @pilatesbuff @matlabulous @timtriche http://t.co/auoFM1xWIw pic.twitter.com/l952WD5ejz
— Ajay Gopal (@aj2z) July 7, 2014
... which means I successfully made Matlab trending with R!!!
During the conference banquet, Jeremy Achin (from DataRobot) suggested that I might as well change my profile photo to a Python logo just to make it even more confusing! It was also very nice to speak to Matt Dowle in person and to learn about his amazing data.table journey from S to R. I have started updating some of my old code to use data.table for the heavy data wrangling tasks.
By the way, Jeremy and the DataRobot team (a dream team of top Kaggle data scientists including Xavier who gave a talk about "10 packages to Win Kaggle Competitions") showed me an amazing demo of their product. Do ask them for a beta account and see for yourself!!!
There are more cool things that I am trying at the moment. I will try to blog about them in the near future. If I have to name a few right now ... that will be:
There are more cool things that I am trying at the moment. I will try to blog about them in the near future. If I have to name a few right now ... that will be:
- Embedding Shiny Apps in R Markdown by RStudio
- subsemble: Ensemble learning in R with the Subsemble algorithm by Erin LeDell
- OpenCPU by Jeroen Ooms
- dendextend: an R package for easier manipulation and visualization of dendrograms by Tal Galili
- Adaptive Resampling in a Parallel World by Max Kuhn
- Packrat - A Dependency Management System for R by J.J. Allaire
(Pheeew! So here is my first blog post related to machine learning - the very purpose of starting this blog. Not bad it finally happened after a whole year!)
It’s really a nice and useful piece of information. I am glad that you shared this useful information with us. Please keeps us to date like this .thank you for sharing.
ReplyDeletePython in-house Corporate training in Nigeria
I am so happy after read your blog. It’s very useful blog for us.
ReplyDeletePython in-house training for employees in Nigeria
Really nice and interesting post. I was looking for this kind of information and enjoyed reading this one. Keep posting. Thanks for sharing.
ReplyDeleteBusiness Analytics Training in Hyderabad
Business Analytics Course in Hyderabad
Great Article
ReplyDeleteArtificial Intelligence Projects
Project Center in Chennai
JavaScript Training in Chennai
JavaScript Training in Chennai
Thanks for sharing this,.
ReplyDeleteLeanpitch provides online training in Advanced Scrum Master during this lockdown period everyone can use it wisely.
Advanced Scrum Master training
Helpful content,Thanks for sharing..
ReplyDeleteLeanpitch provides online training in Advanced Scrum Master during this lockdown period everyone can use it wisely.
Advanced Scrum Master Training Online
Thanks for sharing this article
ReplyDeleteTo crack scrum master interview : Scrum Master Interview Questions
Great work , helpful content
ReplyDeleteLeanpitch provides online training in Advanced Scrum Master during this lockdown period everyone can use it wisely.
Read this blog : Scrum Master Interview Questions
ReplyDeleteTop quality blog with unique content and found valuable looking forward for next updated thank you
Ethical Hacking Course in Bangalore
Much thanks for composing such an intriguing article on this point. This has truly made me think and I plan to peruse more
ReplyDeletebusiness analytics course
ReplyDeleteTruly overall quite fascinating post. I was searching for this sort of data and delighted in perusing this one. Continue posting. Much obliged for sharing.business analytics course
Thanks for sharing this informative content , Great work
ReplyDeleteRead this blog to crack scrum master interview : Scrum Master Interview Questions
Thanks for sharing this informative content , Great work
ReplyDeleteDevops Online Training
Leanpitch provides online training in Devops during this lockdown period everyone can use it wisely.
Thanks for sharing this informative content , Great work
ReplyDeleteLeanpitch provides online certification in scrum master during this lockdown period everyone can use it wisely.
Scrum master certification
Thanks for sharing this informative content , Great work
ReplyDeleteLeanpitch provides online training in devops during this lockdown period everyone can use it wisely.
Devops Online Training
I think such material you should post in video format. You can easy post such video on youtube, for example. If you worry about likes or comments don't do it. From this site https://viplikes.net you can buy youtube comments and likes very fast
ReplyDeleteThanks for sharing this.,
ReplyDeleteLeanpitch provides online training in Scrum Master during this lockdown period everyone can use it wisely.
Join Leanpitch 2 Days CSM Certification Workshop in different cities.
CSM online training
Thanks for sharing this.,
ReplyDeleteLeanpitch provides online training in Scrum Master during this lockdown period everyone can use it wisely.
Join Leanpitch 2 Days CSM Certification Workshop in different cities.
CSM online certification
This comment has been removed by the author.
ReplyDeleteIt would be great if you create tiktok profile and post video with your experiments and video from your conference. I read from here https://www.elmens.com/tech/the-service-to-start-with-how-to-gain-hundreds-and-thousands-of-followers-on-tiktok-within-24-to-72-hours/ that you can get many followers for your profile quite fast
ReplyDeletePretty good post. I just stumbled upon your blog and wanted to say that I have really enjoyed reading your blog posts. Any way I’ll be subscribing to your feed and I hope you post again soon.
ReplyDeletebest data science institute in hyderabad
This comment has been removed by the author.
ReplyDeleteSiberian Husky Puppies For Sale Near Me
ReplyDeleteSiberian Husky Puppies For adoption
White Siberian Husky Puppies For Sale Near Me
This comment has been removed by the author.
ReplyDeleteThanks for sharing this.,
ReplyDeleteLeanpitch provides online training in Scrum Master Certifiation during this lockdown period everyone can use it wisely.
Join Leanpitch 2 Days CSM Certification Workshop in different cities.
CSM online training
CSM training online
I am genuinely thankful to the holder of this web page who has shared this wonderful paragraph at at this place
ReplyDeletedigital marketing courses in hyderabad with placement
Thanks for sharing this.,
ReplyDeleteLeanpitch provides crash course in Facilitating change everyone can use it wisely.
Facilitating change
Facilitating change in the workplace
ReplyDeleteThanks for sharing this informative content.,
Turient is an All-in-one platform for all our teaching needs. If Teaching is your passion ,enabling is ours
Read the Informative blog - 11 Free Teaching Tools for Online Teachers
11 Free Teaching Tools for Online Teachers
Free Teaching Tools
Do you want to make video for youtube about it? You can easy promote your channel if you start to comment other video with such topic and get likes for your comments from this site https://soclikes.com/buy-youtube-comment-likes
ReplyDeleteRegister now to participate in the intensive Artificial Intelligence Course in Hyderabad program taught by experts at the AI Patasala training center.
ReplyDeleteThanks for sharing this awesome blogs with us. Keep sharing more.
ReplyDeleteAI Patasala Artificial Intelligence Courses
I really thank you for the valuable info on this great subject and look forward to more great posts
ReplyDeletedata scientist certification malaysia
This comment has been removed by the author.
ReplyDeleteBecome a Data Science expert with Innomatics. we provide classroom training on Data Science course in Hyderabad for the individuals who believe hand-held training. We teach as per the Indian Standard Time (IST) with In-depth practical Knowledge on each topic in classroom training, 80 – 90 Hrs of Real-time practical training classes. There are different slots available on weekends or weekdays according to your choices.
ReplyDeleteAI Patasala's Data Science Course in Hyderabad with Placements is the ideal option for data science enthusiasts. If you want to become an expert in Data Science, AI Patasala is the best option for you.
ReplyDeleteData Science Training in Hyderabad
Nice blog, thank for sharing with us.
ReplyDeleteData Science Course in Hyderabad
wow, great, I was wondering how to cure acne naturally. and found your site by google, learned a lot, now i’m a bit clear. I’ve bookmark your site and keep us updated.
ReplyDeletecyber security course malaysia
Nice knowledge gaining article. This post is really the best on this valuable topic.
ReplyDeletedata science training in malaysia
Thanks for the informative and helpful post, obviously in your blog everything is good.
ReplyDeletedata science course
Register for the Data Scientist courses in Bangalore and learn to build your Data Science and Machine learning workflows. Build a portfolio of work to have on your resume with live projects which are supported by an industry-relevant curriculum. Get Access to our learning management system (LMS) that provides you with all the material and assignments that will help you master all the concepts for you to solve any problem related to deciphering the hidden meaning in data.
ReplyDeleteData Science Course in Bangalore with Placement
Are you looking for a Data Science certification course to start your career in Data Science then 360DigiTMG is all you need. With experienced professional trainers and real-time projects, you can improve your skillset and move ahead in your profession. Why wait to enroll with us now.
ReplyDeleteBest Data Science Training institute in Bangalore
Are you not ready to risk your and your family's health this pandemic time by joining an offline Data Analyst course. we have a solution for you, enroll in an online Data Analyst course that will equip you with all the knowledge needed for a job in just 6 months.
ReplyDeleteData Science Course in Jaipur
Are you not ready to risk your and your family's health this pandemic time by joining an offline Data Analyst course. we have a solution for you, enroll in an online Data Analyst course that will equip you with all the knowledge needed for a job in just 6 months.
ReplyDeleteBusiness Analytics Course in Jodhpur
Get the best Data Science online course at the comfort of your home. Flexible timings, Best industry trainers, and meticulously crafted curriculum. Avail now!!!!
ReplyDeletedata scientist certification malaysia
Well, I really appreciated for your great work. This topic submitted by you is helpful and keep sharing...
ReplyDeleteBest Divorce Lawyers in Arlington VA
Divorce Attorney in Fairfax
Fairfax Divorce Lawyers
The information you have posted is very useful. The sites you have referred was good. Thanks for sharing.
ReplyDeletefull stack web development course malaysia
ReplyDeleteAbogado de Accidentes de Carro en San Bernardino
Are you interested in learning Power BI and how it can help you unlock the potential of your data? Power BI Course Malaysia is the perfect way to get yourself up to speed with the latest technology. Whether you are looking for a comprehensive course to help you quickly master this powerful software or just need a refresher, we have the perfect solution for you. Explore our range of courses and take advantage of our educational resources - from tutorials and webinars to online classes and live events - so that you can gain the skills and knowledge needed to become a Power BI expert.
ReplyDeletepower bi course malaysia
thanks for valuable info
ReplyDeletegcp training in hyderabad
This is a great follow-up for anyone coming out of the useR! conference and eager to dive into deep learning. H2O is such a powerful tool for building scalable machine learning models, and I love how you've introduced it in an approachable way for R users. The step-by-step walkthrough makes it easy to follow along, especially for those who might be new to deep learning.
ReplyDeleteLooking forward to Part 2! Are there any specific datasets you'd recommend for experimenting with H2O in R?
Digital marketing Course In hyderabad
Great follow-up post after useR! Your insights into exploring deep learning with H2O are both timely and informative. I appreciate how you highlighted practical applications and provided tips for getting started with the framework. It’s exciting to see how accessible deep learning has become, and your examples make it easy for newcomers to dive right in. I’m looking forward to Part 2 and seeing what other topics you’ll cover! Thanks for sharing these valuable resources with the community!
ReplyDeleteDigital Marketing Course In Hyderabad