Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets. This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind. The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages, particularly:
- NumPy for manipulation of homogeneous array-based data
- Pandas for manipulation of heterogeneous and labeled data
- SciPy for common scientific computing tasks, Matplotlib for publication-quality visualizations
- IPython for interactive execution and sharing of code
- Scikit-Learn for machine learning
How much Python should I know?
As with any other (programming) language, it takes years to master it fluently which is beyond the scope this course. Instead, our objective is to have a working knowledge of Python to be able to learn and apply machine learning. To make this explicit we take the following book and online resources as our point of reference. Prior to starting the Data Science Foundation programme, participants are expected to have mastered the following topics:
- A Whirlwind Tour of Python (pages number from the pdf version):
- Know how to install and use Python on your own computer (pages 1 to 13)
- Know basic semantics of variables, objects and operators (pages 13 to 24)
- Know built-in simple values and data structures (pages 24 to 37)
- Know how to use control flow and functions (pages 37 to 45)
- Know how to iterate and use list comprehensions (pages 52 to 61)
- Python Data Science handbook
- Know how to manipulate data with pandas (chapter 3)
The learning path proposed here is similar to the PCEP™ – Certified Entry-Level Python Programmer certification. The PCEP™ certification is a good way to assess your current Python knowledge and to prepare for the Machine Learning Foundation course. The certification is offered by the Python Institute. You may opt to obtain this certificate.
How should I learn Python?
This foundational course aims to cater for participants both with and without programming experience. To ensure everyone is able to acquire the level described above, participants are given access to Real Python and are required to obtain a working knowledge of Python through self-study prior to starting the lectures.
If you have never done any scientific programming before, you can prepare as follows:
- Complete the Python Basics with Real Python.
- Browse the sections from A Whirlwind Tour of Python and Python Data Science Handbook as listed in Section 1 to verify you have understood te basics.
If you have some experience in scientific programming, for example in R or Matlab, you jump into Python as follows:
- read A Whirlwind Tour of Python and chapter 3 of the Python Data Science Handbook
- brush up skills on Real Python on specific topics.
Which Python environment should I use?
Options how to start using Python are listed below. At the very least, make sure you know how to use Deepnote (option 1 listed below). This is the default platform which will be used in class.
Online environment
For those new to Python, it is probably easiest to start with one of these online tools:
- Deepnote: there is a generous free-tier. If you decide to upgrade, you can collaborate and share notebooks privately.
- Google Colab:
- Activate a Google account if you haven’t got one yet. This how-to guide explains how you can link an existing (work) email address to a Google account.
- Work your way through the Colab introduction notebook
Once you have gained some traction, you can move on to install Python on your local machine.
Local environment
You can setup your local machine/laptop for data science and machine learning as a follows
- Install Anaconda (recommended package manager for data science).
- Install Visual Studio Code including the Python extension (recommended integrated development environment (IDE)).
Guidelines for using Python for data science
Using Python for data science is inherently different than using it for, say, building a website. To provide you with some guidance to the many different ways c.q. styles of using Python, please consider the following:
- Focus on using existing data science libraries, instead of writing your own basic functions. If you find yourself spending a lot of time reading documentation, you are on the right track.
- Take a functional approach to programming instead of an object-oriented approach. The former is more fitting for data science, where it is common to structure your work in terms of pipelines and think about each processing step as a function. The latter is more suitable for application development.
For those wanting to further develop their Python skills for data science, the following books are recommended:
- Python for Data Analysis 3rd Edition by Wes McKinney, the creator of pandas.
- Data Science With Python Core Skills on Real Python provides an extensive learning path.
- Hands-On Machine Learning with Scikit-Learn, Keras and Tensorflow (2nd edition) by Aurélien Géron. You will need to purchase the book, but the notebooks with example code are freely available.
- Effective Python: 90 Specific Ways to Write Better Python (second edition).