The article provides a very good example in support of the argument domain specific languages to help you crank out domain specific code. I like R and use it. But I also code my own methods (in haskell these days, but that's not the main point). Before you go and build your company's code base around R realize the following:
1) The primary use case for R is by academics and students comparing various methods on various data sets accessible through R.
2) The code is not really designed with superb reliability in mind. I have debugged a contributor's fortran code and sent in a patch that never appeared in the R code base. The bug remains. Professors often don't support code all that well. Don't blame them. Support is left as an exercise.
3) There are no assurances about the scalability of any particular routine--even if the algorithm scales in theory.
Do try R. It's good. But don't think SAS and the like will disappear. They cater to the production requirements of big companies. And don't use R as an excuse not to write your own production code.
1) With the integration of R and PostgreSQL (PL/R), this might change. For example, calculate the area of a spherical complex polygon on Earth using PL/R:
CREATE FUNCTION plr_polygon_area(
latitude double precision[],
longitude double precision[])
$BODY$
areaPolygon( cbind( longitude, latitude ) )
$BODY$
LANGUAGE 'plr' VOLATILE STRICT
COST 1
ROWS 1;
Pretty powerful.
2) The learning curve for R was 30 days for me. (Still learning, but everything is now no longer alien.) I submitted a bug to a Professor. Not only did he fix the bug within a few hours, but he offered to personally send a new build for my platform. He also suggested a performance improvement for my code (by practically rewriting it) that resulted in code 43 times faster.
The PL/R mailing list has been nothing but helpful and expedient.
3) Scalability of R functions is not too difficult to test.
Foreign function calls are a good thing to support. R does this well and lots of R routines already call C or Fortran libraries.
Domain specific languages are good (and fun). Interacting with open source developers can be an awesome expersience. I agree.
I just don't see R as a platform that can integrate tightly with other business systems. It's a user oriented tool. If there are use cases out there that disprove this. It would be interesting to hear about. Especially as we move into the era of "big data".
PHP, PostgreSQL, R, and JasperReports to analyse 273 million rows of data across 8000 weather stations spanning The Great White (soon to be Green by the looks of it) North for the last 110 years.
The trend line, shown in orange, is calculated in R using a Generalized Additive Model. There is no way I was going to (or even could) write such a complex algorithm myself. When I started the project, I was using MySQL. I migrated the database to PostgreSQL specifically so that I could use R for the analysis. I migrated the database before learning R.
You know, I often forget to mention formula notation as a reason I love R. It is superbly natural. R's natural support for data frame (collections of multidimensional observations) and formula notation make's it easy for every single disparate library to convene on a common language for high-level usage. Pretty near every function shares the same first two parameters:
out <- f(formula, data-frame, ... other options ...)
I certainly agree that coming up the learning curve with R has greatly improved my working knowledge of statistics. However, lately I find myself becoming increasingly frustrated with the language itself as my projects grow in complexity. Of course this could be due to my lack of skill with R or programming in general... Does anyone else have this issue?
I did, quite quickly. In particular, R seems to be focused on univariate data; support for multivariate data (things like "lego plot" histograms, or any kind of histogram in more than one variable) is patchy at best.
Having said that, I must note that I am new to R, and rejecting anything so quickly makes me very nervous. I suspect that I have not even come close to plumbing the depths, so to speak, of R's capabilities.
To both of you: I’m far from a statistician, so YMMV, but after trying to work with multivariate data in both R and MATLAB, I really found using numpy/scipy to be substantially nicer. And even better, if you ever decide to do anything else in a program, like munge text or interact with internet services, a general-purpose language like Python is a big advantage.
You can check out the lattice, ggplot2, scatterplot3d and rgl libraries to see how R handles multivariate data - R is especially well suited for analyzing panel (longitudinal) data with multiple variables.
Google apparently runs R on desktops (http://dataspora.com/blog/predictive-analytics-using-r/) and ports it to Python or C++ once their ideas are tested (for production code). Having said that if your list of functions are growing, try packaging up in a library - there are many tutorials available on the web. If your data is becoming big, try using RHIPE (Hadoop integration), bigmemory, snow, or a number of other packages:
http://cran.r-project.org/web/views/HighPerformanceComputing...
Matlab, R, and S-PLUS are all different animals. They are usually used for different things (and what they are used for can vary from person to person). For instance, I use matlab for object oriented scientific programming, R for quick and dirty (but math intensive) scripts, and S-PLUS for quick and dirty stats intensive scripts. Long story short, S-PLUS is geared towards statistics (better modules, functions, etc). If you do some searching, you'll find that R is actually an offspring of S, with S-PLUS being R's sibling. If one can program in R, picking up S-PLUS should be cake since the syntax and programming hats are similar.
I don't really like to type that much, hence my initial terse commentary--but I can certainly oblige someone who is genuinely interested :-)
S-PLUS objects reside on the hard drive whereas R's are stored entirely in memory. So R is faster for smaller computations but runs against limitations when the data sets are large, though you can use the bigmemory library or store your data in external databases - e.g., SQLite or PostgreSQL and pull off chunks as you need them. R also had a much more extensive library but I heard that S-PLUS (as of version 8) made their program compatible with R so that R's libraries could be used in S-PLUS. Also, R has lexical scoping; I think S-PLUS only has global and local like Matlab. I personally like lexical scoping so can't think of cases when you'd find S-PLUS's scope definition advantageous.
1) The primary use case for R is by academics and students comparing various methods on various data sets accessible through R.
2) The code is not really designed with superb reliability in mind. I have debugged a contributor's fortran code and sent in a patch that never appeared in the R code base. The bug remains. Professors often don't support code all that well. Don't blame them. Support is left as an exercise.
3) There are no assurances about the scalability of any particular routine--even if the algorithm scales in theory.
Do try R. It's good. But don't think SAS and the like will disappear. They cater to the production requirements of big companies. And don't use R as an excuse not to write your own production code.