What is Julia? A fresh approach to numerical computing

What is Julia? A fresh approach to numerical computing

Julia is a free open source, high-level, high-performance, dynamic programming language for numerical computing. It has the development convenience of a dynamic language with the performance of a compiled statically typed language, thanks in part to a JIT-compiler based on LLVM that generates native machine code, and in part to a design that implements type stability through specialization via multiple dispatch, which makes it easy to compile to efficient code.

In the blog post announcing the initial release of Julia in 2012, the authors of the language—Jeff BezansonStefan KarpinskiViral Shah, and Alan Edelman—stated that they spent three years creating Julia because they were greedy. They were tired of the trade-offs among Matlab, Lisp, Python, Ruby, Perl, Mathematica, R, and C, and wanted a single language that would be good for scientific computing, machine learning, data mining, large-scale linear algebra, parallel computing, and distributed computing.

Who is Julia for? In addition to being attractive to research scientists and engineers, Julia is also attractive to data scientists and to financial analysts and quants.

The designers of the language and two others founded Julia Computing in July 2015 to “develop products that make Julia easy to use, easy to deploy, and easy to scale.” As of this writing, the company has a staff of 28 and customers ranging from national labs to banks to economists to autonomous vehicle researchers. In addition to maintaining the Julia open source repositories on GitHub, Julia Computing offers commercial products, including JuliaPro, which comes in both free and paid versions.

Why Julia?

Julia “aims to create an unprecedented combination of ease-of-use, power, and efficiency in a single language.” To the issue of efficiency, consider the graph below:

julia performance comparisonJulia Computing

The figure above shows performance relative to C for Julia and 10 other languages. Lower is better. The benchmarks shown are very low-level tasks. The graph was created using the Gadfly plotting and data visualization system in a Jupyter notebook. The languages to the right of Julia are ordered by the geometric mean of the benchmark results, with LuaJIT the fastest and GNU Octave the slowest.

Julia benchmarks

What we’re seeing here is that Julia code can be faster than C for a few kinds of operations, and no more than a few times slower than C for others. Compare that to, say, R, which can be almost 1,000 times slower than C for some operations.

Note that one of the slowest tests for Julia is Fibonacci recursion; that is because Julia currently lacks tail recursion optimization. Recursion is inherently slower than looping. For real Julia programs that you want to run in production, you’ll want to implement the loop (iteration) form of such algorithms.

Julia JIT compilation

There is a cost to the JIT (just-in-time) compiler approach as opposed to a pure interpreter: The compiler has to parse the source code and generate machine code before your code can run. That can mean a noticeable start-up time for Julia programs the first time each function and macro runs in a session. So, in the screenshot below, we see that the second time we generate a million random floating point numbers, the time taken is an order of magnitude less than on the first execution. Both the @time macro and the rand() function needed to be compiled the first time through the code, because the Julia libraries are written in Julia.

julia> @time rand(10^6);
0.62081 seconds (14.44 k allocations: 8.415 MiB)

julia> @time rand(10^6);
0.004881 seconds (7 allocations: 7.630 MiB)

Julia fans claim, variously, that it has the ease of use of Python, R, or even Matlab. These comparisons do bear scrutiny, as the Julia language is elegant, powerful, and oriented towards scientific computing, and the libraries supply a broad range of advanced programming functionality.

Julia example

As a quick Julia language example, consider the following Mandelbrot set benchmark code:

julia mandelbrot setIDG

Mandelbrot set benchmark in Julia. 

As you can see, complex number arithmetic is built into the language, as are macros for tests and timing. As you can also see, the trailing semicolons that plague C-like languages, and the nested parentheses that plague Lisp-like languages, are absent from Julia. Note that mandelperf() is called twice, in lines 61 and 62. The first call tests the result for correctness and does the JIT-compilation; the second call gets the timing.

Julia programming

Julia has many other features worth mentioning. For one, user-defined types are as fast and compact as built-ins. In fact, you can declare abstract types that behave like generic types, except that they are compiled for the argument types that they are passed.

For another, Julia’s built-in code vectorization means that there is no need for a programmer to vectorize code for performance; ordinary devectorized code is fast. The compiler can take advantage of SIMD instructions and registers if present on the underlying CPU, and unroll the loops in a sequential process to vectorize them as much as the hardware allows. You can mark loops as vectorizable with the @simd annotation.

Julia parallelism

Julia was also designed for parallelism and distributed computation, using two primitives: remote references and remote calls. Remote references come in two flavors: Future and RemoteChannel. A Future is the equivalent of a JavaScript promise; a RemoteChannel is rewritable and can be used for inter-process communication, like a Unix pipe or a Go channel. Assuming that you have started Julia with multiple processes (e.g. julia -p 8 for an eight-core CPU such as an Intel Core i7), you can @spawn or remotecall() function calls to execute on another Julia process asynchronously, and later fetch() the Future returned when you want to synchronize and use the result.

If you don’t need to run on multiple cores, you can utilize lightweight “green” threading, called a Task() in Julia and a coroutine in some other languages. A Task() or @task works in conjunction with a Channel, which is the single-process version of RemoteChannel.

Julia type system

Julia has an unobtrusive yet powerful type system that is dynamic with run-time type inference by default, but allows for optional type annotations. This is similar to TypeScript. For example:

julia> (1+2)::AbstractFloat
ERROR: TypeError: typeassert: expected AbstractFloat, got Int64
julia> (1+2)::Int
3

Here we are asserting an incompatible type the first time, causing an error, and a compatible type the second time.

Julia strings

Julia has efficient support for Unicode strings and characters, stored in UTF-8 format, as well as efficient support for ASCII characters, since in UTF-8 the code points less than 0x80 (128) are encoded in a single character. Otherwise, UTF-8 is a variable-length encoding, so you can’t assume that the length of a Julia string is equal to the last character index.

Full support for UTF-8 means, among other things, that you can easily define variables using Greek letters, which can make scientific Julia code look very much like the textbook explanations of the formulas, e.g. sin(2π). A transcode() function is provided to convert UTF-8 to and from other Unicode encodings.

C and Fortran functions

Julia can call C and Fortran functions directly, with no wrappers or special APIs needed, although you do need to know the “decorated” function name emitted by the Fortran compiler. The external C or Fortran function must be in a shared library; you use the Julia ccall() function for the actual call out. For example, on a Unix-like system you can use this Julia code to get an environment variable’s value using the getenv function in libc:

function getenv(var::AbstractString)
val = ccall((:getenv, "libc"),
Cstring, (Cstring,), var)
if val == C_NULL
error("getenv: undefined variable: ", var)
end
unsafe_string(val)
end

julia> getenv("SHELL")
"/bin/bash"

Julia macros

Julia has Lisp-like macros, as distinguished from the macro preprocessors used by C and C++. Julia also has other meta-programming facilities, such as reflection, code generation, symbol (e.g. :foo) and expression (e.g. :(a+b*c+1) ) objects, eval(), and generated functions. Julia macros are evaluated at parsing time.

Generated functions, on the other hand, are expanded when the types of their parameters are known, prior to function compilation. Generated functions have the flexibility of generic functions (as implemented in C++ and Java) and the efficiency of strongly typed functions, by eliminating the need for run-time dispatch to support parametric polymorphism.

GPU support

Julia has GPU support using, among others, the MXNet deep learning package, the ArrayFire GPU array library, the cuBLAS and cuDNN linear algebra and deep neural network libraries, and the CUDA framework for general purpose GPU computing. The Julia wrappers and their respective libraries are shown in the diagram below.

julia gpu packagesJulia Computing

You can draw on a number of Julia packages to program GPUs at different abstraction levels. 

JuliaPro and Juno IDE

You can download the free open source Julia command line for Windows, MacOS, generic Linux, or generic FreeBSD from the Julia language site. You can clone the Julia source code repository from GitHub.

Alternatively you can download JuliaPro from Julia Computing. In addition to the compiler, JuliaPro gives you the Atom-based Juno IDE (shown below) and more than 160 curated packages, including visualization and plotting.

Beyond what’s in the free JuliaPro, you can add subscriptions for enterprise support, quantitative finance functionality, database support, and time series analysis. JuliaRun is a scalable server for a cluster or cloud.

julia juno ideIDG

Juno is a free Julia IDE based on the Atom text editor. 

Jupyter notebooks and IJulia

In addition to using Juno as your Julia IDE, you can use Visual Studio Code with the Julia extension (shown directly below), and Jupyter notebooks with the IJulia kernel (shown in the second and third screenshots below). You may need to install Jupyter notebooks for Python 2 or (preferably) Python 3 with Anaconda or pip.

julia visual studio codeIDG

Visual Studio Code with the Julia extension. 

julia jupyter notebookIDG

Launching a Julia kernel from Jupyter notebook.

julia jupyter sine plotIDG

Plotting a sine wave using Julia in a Jupyter notebook.

JuliaBox

You can run Julia in Jupyter notebooks online using JuliaBox (shown below), another product of Julia Computing, without doing any installation on your local machine. JuliaBox currently includes more than 300 packages, runs Julia 0.6.2, and contains dozens of tutorial Jupyter notebooks. The top-level list of tutorial folders is shown below. The free level of JuliaBox access gives you 90-minute sessions with three CPU cores; the $14 per month personal subscription gives you four-hour sessions with five cores; and the $70 per month pro subscription gives you eight-hour sessions with 32 cores. GPU access is not yet available as of June 2018.

julia juliabox tutorialsIDG

JuliaBox runs Julia in Jupyter notebooks online. 

Julia packages

Julia “walks like Python, but runs like C.” As my colleague Serdar Yegulalp wrote in December 2017, Julia is starting to challenge Python for data science programming, and both languages have advantages. As an indication of the rapidly maturing support for data science in Julia, consider that there are already two books entitled Julia for Data Science, one by Zacharias Voulgaris, and the other by Anshul Joshi, although I can’t speak to the quality of either one.

If you look at the overall highest-rated Julia packages from Julia Observer, shown below, you’ll see a Julia kernel for Jupyter notebooks, the Gadfly graphics package (similar to ggplot2 in R), a generic plotting interface, several deep learning and machine learning packages, differential equation solvers, DataFrames, New York Fed dynamic stochastic general equilibrium (DSGE) models, an optimization modeling language, and interfaces to Python and C++. If you go a little farther down this general list, you will also find QuantEcon, PyPlot, ScikitLearn, a bioinformatics package, and an implementation of lazy lists for functional programming.

julia top packagesIDG

Julia’s top packages. 

If the Julia packages don’t suffice for your needs, and the Python interface doesn’t get you where you want to go, you can also install a package that gives you generic interfaces to R (RCall) and Matlab.

Julia for financial analysts and quants

Quants and financial analysts will find many free packages to speed their work, as shown in the screenshot below. In addition, Julia Computing offers the JuliaFin suite, consisting of Miletus (a DSL for financial contracts), JuliaDB (a high performance in-memory and distributed database), JuliaInXL (call Julia from Excel sheets), and Bloomberg connectivity (access to real-time and historical market data).

julia top finance packagesIDG

Julia’s top finance packages. 

1 2 Page 2

Julia for researchers

Researchers will find many packages of interest, as you can see from the category names in the right-hand column above. In addition, many of the base features of the Julia language are oriented towards science, engineering, and analysis. For example, as you can see in the screenshot below, matrices and linear algebra are built into the language at a sophisticated level.

julia juliabox linear algebraIDG

Julia offers sophisticated support for multi-dimensional arrays and linear algebra operations. 

Learn Julia

As you’ve seen, you can use Julia and many packages for free, and buy enterprise support and advanced features if you need them. There are a few gotchas to consider as you’re starting to evaluate Julia.

First, you need to know that ordinary global variables make Julia slow. That’s because variables at global scope don’t have a fixed type unless you’ve declared one, which in turn means that functions and expressions using the global variable have to handle any type. It’s much more efficient to declare variables inside the scope of functions, so that their type can be determined and the simplest possible code to use them can be generated.

Second, you need to know that variables declared at top level in the Julia command line are global. If you can’t avoid doing that, you can make performance a little better (or less awful) by declaring them const. That doesn’t mean that the value of the variable can’t change—it can. It means that the type of the variable can’t change.

Finally, read the Julia manual and the official list of Julia learning resources. In particular, read the getting started section of the manual and watch Jane Herriman’s introductory tutorial and any other videos in the learning resources that strike you as relevant. If you would prefer to follow along on your own machine rather than on JuliaBox, you may want to clone the JuliaBoxTutorials repo from GitHub and run the Local_installations notebook from Jupyter to install all the packages needed.

Source: InfoWorld Big Data

Review: Amazon SageMaker scales deep learning

Review: Amazon SageMaker scales deep learning

Amazon SageMaker, a machine learning development and deployment service introduced at re:Invent 2017, cleverly sidesteps the eternal debate about the “best” machine learning and deep learning frameworks by supporting all of them at some level. While AWS has publicly supported Apache MXNet, its business is selling you cloud services, not telling you how to do your job.

SageMaker, as shown in the screenshot below, lets you create Jupyter notebook VM instances in which you can write code and run it interactively, initially for cleaning and transforming (feature engineering) your data. Once the data is prepared, notebook code can spawn training jobs in other instances, and create trained models that can be used for prediction. SageMaker also sidesteps the need to have massive GPU resources constantly attached to your development notebook environment by letting you specify the number and type of VM instances needed for each training and inference job.

Trained models can be attached to endpoints that can be called as services. SageMaker relies on an S3 bucket (that you need to provide) for permanent storage, while notebook instances have their own temporary storage.

SageMaker provides 11 customized algorithms that you can train against your data. The documentation for each algorithm explains the recommended input format, whether it supports GPUs, and whether it supports distributed training. These algorithms cover many supervised and unsupervised learning use cases and reflect recent research, but you aren’t limited to the algorithms that Amazon provides. You can also use custom TensorFlow or Apache MXNet Python code, both of which are pre-loaded into the notebook, or supply a Docker image that contains your own code written in essentially any language using any framework. A hyperparameter optimization layer is available as a preview for a limited number of beta testers.

Source: InfoWorld Big Data

Technology of the Year 2018: The best hardware, software, and cloud services

Technology of the Year 2018: The best hardware, software, and cloud services

Was 2017 the year that every product under the sun was marketed as being cognitive, having machine learning, or being artificially intelligent? Well, yes. But don’t hate all of them. In many cases, machine learning actually did improve the functionality of products, sometimes in surprising ways.

Our reviewers didn’t give any prizes for incorporating AI, but did pick out the most prominent tools for building and training models. These include the deep learning frameworks Tensor­Flow and PyTorch, the automated model-building package H2O.ai Driverless AI, and the solid machine learning toolbox Scikit-learn.

The MLlib portion of Apache Spark fits into this group as well, as does the 25-year-old(!) R programming language, of which our reviewer says, “No matter what the machine learning problem, there is likely a solution in CPAN, the comprehensive repository for R code, and in all likelihood it was written by an expert in the domain.”

2017 was also the year when you could pick a database without making huge compromises. Do you need SQL, geographic distribution, horizontal scalability, and strong consistency? Both Google Cloud Spanner and CockroachDB have all of that. Do you need a distributed NoSQL database with a choice of APIs and consistency models? That would be Microsoft’s Azure Cosmos DB.

Are you serving data from multiple endpoints? You’ll probably want to use GraphQL to query them, and you might use Apollo Server as a driver if your client is a Node.js application. Taking a more graph-oriented view of data, a GraphQL query looks something like a JSON structure with the data left out.

As for graph database servers, consider Neo4j, which offers highly available clusters, ACID transactions, and causal consistency. Are you looking for an in-memory GPU-based SQL database that can update geospatial displays of billions of locations in milliseconds? MapD is what you need.

Two up-and-coming programming languages made the cut, for completely different domains. Kotlin looks like a streamlined version of object-oriented Java, but it is also a full-blown functional programming language, and most importantly eliminates the danger of null pointer references and eases the handling of null values. Rust, on the other hand, offers memory safety in an alternative to C and C++ that is designed for bare-metal and systems-level programming.

Speaking of safety, we also salute two security products—one for making it easier for developers to build secure applications, the other for extending security defenses to modern application environments. GitHub security alerts notify you when GitHub detects a vulnerability in one of your GitHub project dependencies, and suggest known fixes from the GitHub community. Signal Sciences protects against threats to your web applications and APIs. 

If you’ve started deploying Docker containers, sooner or later you’re going to want to orchestrate and manage clusters of them. For that, you’ll most likely want Kubernetes, either by itself, or as a service in the AWS, Azure, or Google clouds. Honeycomb goes beyond monitoring and logging to give your distributed systems observability.

Recently, the heavyweight Angular and React frameworks have dominated the discussion of JavaScript web applications. There’s a simpler framework that is gaining mindshare, however: Vue.js. Vue.js still builds a virtual DOM, but it doesn’t make you learn non-standard syntax or install a specialized tool chain just to deploy a site.

Microsoft’s relationship with Linux has been troubled over the years, to say the least. For example, in 2001 Steve Ballmer called Linux a “cancer.” The need for Linux in the Azure cloud changed all that, and the Windows Subsystem for Linux allows you to run a for-real Ubuntu or Suse Bash shell in Windows 10, allowing you to install and run legitimate Linux binary apps from the standard repositories, including the Azure Bash command line.

Read about all of these winning products, and many more, in our tour of 2018 Technology of the Year Award winners.

Source: InfoWorld Big Data

TensorFlow review: The best deep learning library gets better

TensorFlow review: The best deep learning library gets better

If you looked at TensorFlow as a deep learning framework last year and decided that it was too hard or too immature to use, it might be time to give it another look.

Since I reviewed TensorFlow r0.10 in October 2016, Google’s open source framework for deep learning has become more mature, implemented more algorithms and deployment options, and become easier to program. TensorFlow is now up to version r1.4.1 (stable version and web documentation), r1.5 (release candidate), and pre-release r1.6 (master branch and daily builds).

The TensorFlow project has been quite active. As a crude measure, the TensorFlow repository on GitHub currently has about 27 thousand commits, 85 thousand stars, and 42 thousand forks. These are impressive numbers reflecting high activity and interest, exceeding even the activity on the Node.js repo. A comparable framework, MXNet, which is strongly supported by Amazon, has considerably lower activity metrics: less than 7 thousand commits, about 13 thousand stars, and less than 5 thousand forks. Another statistic of note, from the TensorFlow r1.0 release in February 2017, is that people were using TensorFlow in more than 6,000 open source repositories online.

Much of the information in my TensorFlow r0.10 review and my November 2016 TensorFlow tutorial is still relevant. In this review I will concentrate on the current state of TensorFlow as of January 2018, and bring out the important features added in the last year or so.

Source: InfoWorld Big Data

Review: H2O.ai automates machine learning

Review: H2O.ai automates machine learning
ed choice plumInfoWorld

Machine learning, and especially deep learning, have turned out to be incredibly useful in the right hands, as well as incredibly demanding of computer hardware. The boom in availability of high-end GPGPUs (general purpose graphics processing units), FPGAs (field-programmable gate arrays), and custom chips such as Google’s Tensor Processing Unit (TPU) isn’t an accident, nor is their appearance on cloud services.

But finding the right hands? There’s the rub—or is it? There is certainly a perceived dearth of qualified data scientists and machine learning programmers. Whether there’s a real lack or not depends on whether the typical corporate hiring process for data scientists and developers makes sense. I would argue that the hiring process is deeply flawed in most organizations.

If companies teamed up domain experts, statistics-literate analysts, SQL programmers, and machine learning programmers, rather than trying to find data scientists with Ph.D.s plus 20 years of experience who were under 39, they would be able to staff up. Further, if they made use of a tool such as H2O.ai’s Driverless AI, which automates a significant portion of the machine learning process, they could make these teams dramatically more efficient.

As we’ll see, Driverless AI is an automatically driven machine learning system that is able to create and train surprisingly good models in a surprisingly short time, without requiring data science expertise. However, while Driverless AI reduces the level of machine learning, feature engineering, and statistical expertise required, it doesn’t eliminate the need to understand your data and the statistical and machine learning algorithms you’re applying to it.  

Source: InfoWorld Big Data

R tutorial: Learn to crunch big data with R

R tutorial: Learn to crunch big data with R

A few years ago, I was the CTO and cofounder of a startup in the medical practice management software space. One of the problems we were trying to solve was how medical office visit schedules can optimize everyone’s time. Too often, office visits are scheduled to optimize the physician’s time, and patients have to wait way too long in overcrowded waiting rooms in the company of people coughing contagious diseases out their lungs.

One of my cofounders, a hospital medical director, had a multivariate linear model that could predict the required length for an office visit based on the reason for the visit, whether the patient needs a translator, the average historical visit lengths of both doctor and patient, and other possibly relevant factors. One of the subsystems I needed to build was a monthly regression task to update all of the coefficients in the model based on historical data. After exploring many options, I chose to implement this piece in R, taking advantage of the wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques implemented in the R system.

One of the attractions for me was the R scripting language, which makes it easy to save and rerun analyses on updated data sets; another attraction was the ability to integrate R and C++. A key benefit for this project was the fact that R, unlike Microsoft Excel and other GUI analysis programs, is completely auditable.

Alas, that startup ran out of money not long after I implemented a proof-of-concept web application, at least partially because our first hospital customer had to declare Chapter 7 bankruptcy. Nevertheless, I continue to favor R for statistical analysis and data science.

Source: InfoWorld Big Data

Bossie Awards 2017: The best databases and analytics tools

Bossie Awards 2017: The best databases and analytics tools

CockroachDB is a cloud-native SQL database for building global, scalable cloud services that survive disasters. Built on a transactional and strongly consistent key-value store, CockroachDB scales horizontally, survives disk, machine, rack, and even datacenter failures with minimal latency disruption and no manual intervention, supports strongly consistent ACID transactions, and provides a familiar SQL API for structuring, manipulating, and querying data. CockroachDB was inspired by Google’s Spanner and F1 technologies.

— Martin Heller

Source: InfoWorld Big Data

Bossie Awards 2017: The best machine learning tools

Bossie Awards 2017: The best machine learning tools

Core ML is Apple’s framework for integrating trained machine learning models into an iOS or MacOS app. Core ML supports Apple’s Vision framework for image analysis, Foundation framework for natural language processing, and GameplayKit framework for evaluating learned decision trees. Currently, Core ML cannot train models itself, and the only trained models available from Apple in Core ML format are for image classification. However, Core ML Tools, a Python package, can convert models from Caffe, Keras, scikit-learn, XGBoost, and LIBSVM.

— Martin Heller

Source: InfoWorld Big Data

Review: Domo is good BI, not great BI

Review: Domo is good BI, not great BI

In the last couple of years I have reviewed four of the leading business intelligence (BI) products: Tableau, Qlik Sense, Microsoft Power BI, and Amazon QuickSight. In general terms, Tableau sets the bar for ease of use, and Power BI sets the bar for low price.

Domo is an online BI tool that combines a large assortment of data connectors, an ETL system, a unified data store, a large selection of visualizations, integrated social media, and reporting. Domo claims to be more than a BI tool because its social media tool can lead to “actionable insights,” in practice every BI tool either leads to actions that benefit the business or winds up tossed onto the rubbish heap.

Domo is a very good and capable BI system. It stands out with support for lots of data sources and lots of chart types, and the integrated social media feature is nice (if overblown). However, Domo is harder to learn and use than Tableau, Qlik Sense, and Power BI, and at $2,000 per user per year it is multiples more expensive.

Depending on your needs, Tableau, Qlik Sense, or Power BI is highly likely to be a better choice than Domo.  

Source: InfoWorld Big Data

Review: Tableau takes self-service BI to new heights

Review: Tableau takes self-service BI to new heights

Since I reviewed Tableau, Qlik Sense, and Microsoft Power BI in 2015, Tableau and Microsoft have solidified their leadership in the business intelligence (BI) market: Tableau with intuitive interactive exploration, Microsoft with low price and Office integration. Qlik is still a leader compared to the other 20 vendors in the sector, but trails both Tableau and Power BI.

ed choice plumInfoWorld

In addition to new analytics, mapping, and data connection features, Tableau has added better support of enterprises and mobile devices in the last two years. In this review, I’ll give you a snapshot of Tableau as it now stands, drill in on features new since version 9, and explore the Tableau road map.

Source: InfoWorld Big Data