Jake VanderPlas in the Python Data Science Handbook

Python has emerged over the last couple decades as a first-class tool for scientific computing tasks, including the analysis and visualization of large datasets. This may have come as a surprise to early proponents of the Python language: the language itself was not specifically designed with data analysis or scientific computing in mind. The usefulness of Python for data science stems primarily from the large and active ecosystem of third-party packages, particularly:

  • NumPy for manipulation of homogeneous array-based data
  • Pandas for manipulation of heterogeneous and labeled data
  • SciPy for common scientific computing tasks, Matplotlib for publication-quality visualizations
  • IPython for interactive execution and sharing of code
  • Scikit-Learn for machine learning

How much Python should I know?

As with any other (programming) language, it takes years to master it fluently which is beyond the scope this course. Instead, our objective is to have a working knowledge of Python to be able to learn and apply machine learning. To make this explicit we take the following book and online resources as our point of reference. Prior to starting the Data Science Foundation programme, participants are expected to have mastered the following topics:

  • A Whirlwind Tour of Python (pages number from the pdf version):
    • Know how to install and use Python on your own computer (pages 1 to 13)
    • Know basic semantics of variables, objects and operators (pages 13 to 24)
    • Know built-in simple values and data structures (pages 24 to 37)
    • Know how to use control flow and functions (pages 37 to 45)
    • Know how to iterate and use list comprehensions (pages 52 to 61)
  • Python Data Science handbook
PCEP™ – Certified Entry-Level Python Programmer

The learning path proposed here is similar to the PCEP™ – Certified Entry-Level Python Programmer certification. The PCEP™ certification is a good way to assess your current Python knowledge and to prepare for the Machine Learning Foundation course. The certification is offered by the Python Institute. You may opt to obtain this certificate.

How should I learn Python?

This foundational course aims to cater for participants both with and without programming experience. To ensure everyone is able to acquire the level described above, participants are given access to Real Python and are required to obtain a working knowledge of Python through self-study prior to starting the lectures.

If you have never done any scientific programming before, you can prepare as follows:

If you have some experience in scientific programming, for example in R or Matlab, you jump into Python as follows:

Which Python environment should I use?

Options how to start using Python are listed below. At the very least, make sure you know how to use Deepnote (option 1 listed below). This is the default platform which will be used in class.

Online environment

For those new to Python, it is probably easiest to start with one of these online tools:

  • Deepnote: there is a generous free-tier. If you decide to upgrade, you can collaborate and share notebooks privately.
  • Google Colab:

Once you have gained some traction, you can move on to install Python on your local machine.

Local environment

You can setup your local machine/laptop for data science and machine learning as a follows

Guidelines for using Python for data science

Using Python for data science is inherently different than using it for, say, building a website. To provide you with some guidance to the many different ways c.q. styles of using Python, please consider the following:

  • Focus on using existing data science libraries, instead of writing your own basic functions. If you find yourself spending a lot of time reading documentation, you are on the right track.
  • Take a functional approach to programming instead of an object-oriented approach. The former is more fitting for data science, where it is common to structure your work in terms of pipelines and think about each processing step as a function. The latter is more suitable for application development.

For those wanting to further develop their Python skills for data science, the following books are recommended: