How can I integrate R and Python

Python vs R Statistics

Newcomers repeatedly ask me whether it would be more worthwhile to get started and familiarize yourself with the Python programming language than with R Statistics. Now there are already many discussions and wars of faith on this comparison in the English-language portals - I deliberately did not read them further, but I am trying to bring my experience to the blog and I am looking forward to your opinions / experiences!

Get there faster with less R code, and beyond with Python

What immediately struck me when I started with R: After the installation, you can start immediately! A plot or a regression analysis is done within a few lines of code, because the language comes with these functions by default. In Python the goal is not far away either, but the MatplotLib must first be installed for the plots, the Numpy library for matrix calculation and the Pandas library in order to obtain a data structure in Python that is comparable to the R data structure Data.Frame . These Python libraries can rightly be seen as part of the Python universe, but they are not delivered as standard and they should also be strictly separated from the standard Python in the application, in plain language: The libraries require extra training and make handling more complicated, the simple python loses its simplicity to some extent.

The popular R-Studio development environment is also unrivaled and in my opinion is absolutely superior to IPython in terms of usability. R is simply designed to analyze and visualize data, but is also limited to that.

“R is more about sketching, and not building,” says Michael Driscoll, CEO of Metamarkets. "You won’t find R at the core of Google’s page rank or Facebook’s friend suggestion algorithms. Engineers will prototype in R, then hand off the model to be written in Java or Python. ”

In return, Python is a programming language that is not tied to just one purpose. With Python, (web) server or desktop applications and thus analytical applications can also be developed completely in Python without any technological break. And even if R also brings an unmanageable number of packages, Python offers a lot more, for example for the three-dimensional representation of graphs.

Software developers love Python, mathematicians prefer R

Data science is an extremely interdisciplinary field and data scientists can be mathematicians, physicists, computer scientists, engineers or (albeit less often) economists or the humanities. A large part comes from mathematics or extremely mathematical subjects such as physicists or electrical engineering. In these courses, programming languages ​​that were developed by mathematicians for mathematicians, i.e. R Statistics, MATLAB or Octave, are predominantly used. For example, my wife studied electrical engineering and implemented all of her machine learning prototypes in MATLAB, but she also finds her way around R well.

Those who come from software development will probably find their way around Python much faster than in R. In my subjective perception, I actually find that those data scientists who have come from mathematics to data science mostly prefer R and those who come from application development tend to work with Python.

Python collaborates better

A data scientist rarely comes alone, because data science is teamwork. And where teams are to achieve a common goal, special requirements are placed on the working environment. Python is a syntactically easy-to-understand programming language that is sometimes even referred to as “executable pseudocode” (which is, however, slightly exaggerated ...). So it's a relatively easy language to learn for all team members. Python does not have to be favored by all team members, because your own local prototypes can be created in R, Octave or whatever, but can then also be easily integrated into Python. For really fast applications, Python and R are too slow as interpreter languages ​​anyway, such applications will ultimately have to be implemented in C / C ++, but even then Python offers advantages that should not be underestimated: The success of Python in scientific computing is based on its uncomplicated form Integration of source code of the programming languages ​​C, C ++ and Fortran.

New players on the field: Scala and Julia

Unfortunately, I can't (yet) say much about the two programming languages ​​Scala and Julia. Scala seems to be emerging as a new alternative to Python in my opinion. Scala is a product from the Java universe and was intended as a programming language for a wide variety of purposes. The language continues to gain acceptance in Big Data Science, some tools for Big Data Analytics (Apache Spark, Apache Flink) are designed for Scala and are themselves based on this programming language. What makes Scala very appealing as a language that is strongly inspired by Java is the extremely compact code. A MapReduce algorithm can be created in Scala with a fraction of the code than would be the case in Java, as the code examples on the Spark website clearly show: (What is Apache Spark?)

Text Search in Python (Apache Spark)
textFile = sc.textFile ("hdfs: // ...")
# Creates a DataFrame having a single column named "line"
df = (lambda r: Row (r)). toDF (["line"])
errors = df.filter (col ("line"). like ("% ERROR%"))
# Counts errors mentioning MySQL
errors.filter (col ("line"). like ("% MySQL%")). count ()
# Fetches the MySQL errors as an array of strings
errors.filter (col ("line"). like ("% MySQL%")). collect ()
Text Search in Scala (Apache Spark)
val textFile = sc.textFile ("hdfs: // ...")
// Creates a DataFrame having a single column named "line"
val df = textFile.toDF ("line")
val errors = df.filter (col ("line"). like ("% ERROR%"))
// Counts errors mentioning MySQL
errors.filter (col ("line"). like ("% MySQL%")). count ()
// Fetches the MySQL errors as an array of strings
errors.filter (col ("line"). like ("% MySQL%")). collect ()
Text Search in Java (Apache Spark)
// Creates a DataFrame having a single column named "line"
public Row call (String line) throws Exception {
return RowFactory.create (line);
fields.add (DataTypes.createStructField ("line", DataTypes.StringType, true));
StructType schema = DataTypes.createStructType (fields);
DataFrame df = sqlContext.createDataFrame (rowRDD, schema);
DataFrame errors = df.filter (col ("line"). Like ("% ERROR%"));
// Counts errors mentioning MySQL
errors.filter (col ("line"). like ("% MySQL%")). count ();
// Fetches the MySQL errors as an array of strings
errors.filter (col ("line"). like ("% MySQL%")). collect ();

Julia was developed (similar to R) explicitly for the purpose of statistical data analysis, is hardly used productively due to the current beta status. Since Julia is geared towards very fast applications, Julia offers new hope for those for whom R and Python are too slow interpreter languages.

Book recommendations for getting started with R or Python

It goes without saying that I own all the books myself and have read more than just the foreword ...

What is your experience? You are asked!

Just write your opinion as a comment on this article! Anyone who thinks they can put the comparison on digital paper in a more logical, “correct” and more comprehensible manner, is welcome to send an article suggestion to [email protected]!

Benjamin Aunkofer

Benjamin Aunkofer is lead data scientist at DATANOMIQ and university lecturer for data science and data strategy. In addition, he works as Interim Head of Business Intelligence and gives seminars / workshops on BI, data science and machine learning for companies.

Tags:Data Science, Python, R Statistics