Advanced Compilers in
Independent Study – Fall 2017
Computer and Information Technology
The global aim of the project is to develop a
realistic machine learning pipeline optimizing data loading and the preparation
of data from various sources in addition to the actual computation. Through the
course of the independent study, the objective was to understand the Scala
programming language and explore concepts of TensorFlow, Flare and the working
of TensorFlow and Flare in conjunction (‘TensorFlare’). The report is divided
into three sections – the first explains and demonstrates the working of TensorFlow,
the second, Flare and the third explains and demonstrates the performance
benefits of chaining an accelerator like Flare to TensorFlow.
Figure 1: Dataflow graphs
TensorFlow is an open source software library
developed by the Google Brain team for heavy numerical computation with its
main applications in the areas of Machine Learning and Deep Neural Networks.
TensorFlow provides a Python and C++ API, but the Python interface is more
complete and easier to use. TensorFlow enables faster computation when compared
to other deep learning libraries while also supporting CPUs, GPUs and
distributed processing in a cluster.
TensorFlow structure is based on the execution of a dataflow graph. The
dataflow graph consists of two basic units: nodes and edges. Nodes represent
operators and the edges represent multidimensional array (Tensors) that flow
between the nodes.
Computations in TensorFlow are structured as
stateful dataflow graphs that provide flexibility for training and executing
models for research purposes while also being robust enough for deployment in
production training 1. As an example to demonstrate TensorFlow, the MNIST
program is executed. The MNIST database (Modified National Institute of
Standards and Technology database) is a large database containing handwritten
digits used for training different image processing systems 3. The objective is to determine what the digit is by processing the image. Figure
2 shows a sample of the MNIST dataset.
Figure 2: Images in MNIST dataset
As a first step to building the
model, two placeholder operations are created as shown in the code given below.
Inputs are given to nodes through variables or placeholders. A placeholder is a
variable that is never initialized and data is assigned to it at a later time
during training. The image is first flattened and represented using a 28*28
matrix. Flattening an image is converting a 2D array to a 1D array by
unstacking the rows and lining them up. The input images x is represented as a
2D tensor of numbers, where the 784 (28*28) represents the dimensionality of a
single flattened MNIST image. As seen in Figure 3, we initialize the
placeholder that holds the training images with the parameters none, 28, 28, 1.
The ‘none’ parameter indicates the number of training images which will be
filled during the training session, 28*28 refers to the pixels in the images
and 1 is used for color representation and is 1 in our case as the images are
grayscale. The output y (a tensor as well) is a 10 dimensional vector showing
which digit the corresponding MNIST image belongs to.
The weights and the biases of the
model are defined next. The weights are the probabilities that affect how data
flows in the graph – these are updated continuously during training to move the
result closer and closer to the real solution. The bias lets us shift the
regression line to better fit the data.
In TensorFlow, the computation will
only run after the creation of a Session. To execute the dataflow graph, a
Session is initialized. Starting from an empty initial graph when the session
is created, the interface supports an Extend method that augments the empty
graph with new edges and nodes to carry out the computation. 1
Figure 3: MNIST
Program – Code snippet
To begin the model training process,
a loop is set up and we train the model 100 images at a time. Once the training
is completed, the model is tested by comparing the model data to the output
values and the accuracy is calculated and printed out. The accuracy increases
with iterations and eventually satisfactorily predicts the number in image.
The training output of the model is
as shown in 4.
Figure 4: MNIST
Program – Training Output
Figure 5 shows the value of our
training accuracy (percentage of correctly recognized digits), which we observe
increases with time.
Figure 5: MNIST
Program Model Accuracy
loss (Figure 6) is a parameter used to evaluate the performance of a
classification model, the loss function is used to drive the training process
by presenting a variable that increases as the predicted probability shifts
away from the actual label and our model tries to minimize it. Figures 7 and 8
represent the spread of the model weights and biases.
Figure 6: MNIST
Program – Cross Entropy Loss
Figure 7: Spread
of weights with time
Figure 8: Spread
of biases with time
Spark supports SQL front-end and query optimization techniques
with a promise of achieving added expressiveness and performance. Spark uses
Catalyst that performs rule based and cost based optimizations, and Tungsten
that improves performance by adopting memory management techniques, using cache
aware data structures and Java code generation that is compiled to JVM at
runtime 34. However, on investigation, it becomes quite apparent that Spark
performs comparitively worse than query engines such as Hyper which could be
attributed to the fact that Spark uses Java based techniques.
Flare is backend for Spark that improves the query execution
without modifying Spark’s front-end hence retaining its expressiveness. Flare
achieves the improved performance by compilation to native code, replacing
parts of Spark’s runtime system and extending the optimization to large classes
of UDFs 5.
Flare Level 1: At
level 1, Flare achieves better performance by adding rules to the Catalyst
Optimization Logic which causes Tungsten to invoke Flare to generate C code for
supported operators, executed by Spark’s runtime 5. Since no modifications
are made to the execution model and memory management is not performed, at
level 1, Flare is lightweight and can enable better performance of some, if not
Flare Level 2: At
level 2, the operator nodes in the query plan tree are mapped to Flare’s
internal optimized data structures and query execution logic which are
represented in LMS. Lightweight Modular Staging is a runtime code generation
approach which performs optimizations and distinguishes expressions that are to
be evaluated ‘now’ and those that can be evaluated at a later time. Further, at
this level, Flare implements compiled CSV and Parquet which helps generate
native code that is tailored to a given schema 5.
Flare Level 3: Flare
Level 3 handles heterogeneous workloads consisting of user defined code and
different Domain Specific Languages (DSL) using Delite. Delite, which is built
on top of LMS, is a compiler framework and runtime for embedded DSLs. Flare
exports a Spark SQL optimized query and maps it to Delite’s OptiQL.
Figure 11: TPC-H Query 4
To demonstrate Flare’s performance, we can execute the TPC-H benchmark
which provides us with a set of ad-hoc queries. In this example, data with a
scale factor of 1 is generated (1GB in size). The queries executed and their
results are shown below.
TPC-H Query 19:
Figure 13: TPC-H Query 5
Figure 10: TPC-H Query 19 Output
Figure 12: TPC-H Query 4 Output
Figure 9: TPC-H Query 19
Figure 14: TPC-H Query 5 Output
TPC-H Query 4:
TPC-H Query 5:
Figure 17: TensorFlare – Crime query
Figure 16: TensorFlare – Code Snippet
Figure 15: TensorFlare – Training session output
Section 2 and 3 have described the working of Tensorflow and Flare
independently, but using them in conjunction can help us achieve accelerated
performance in machine learning pipelines. To demonstrate the working of
TensorFlare, we perform simple classification on US Crime data. The different
classes that the data can belong to are murder, assault, population and rape.
To train our model, we use an approach similar to the MNIST example discussed earlier.
Running the code on the crime data, we print the accuracy of our training
session as well as the weight and the bias. A sample output of the training
session is shown in Figure 15.
high accuracy is obtained as our data consists of only about 2000 records. In
this implementation, the schema is manually defined in Flare for the crime
statistics dataset. The parameters obtained from the previous training session
are assigned as the weights and bias for the classification. Classifier, a UDF
in Spark which is created within Flare is where the TensorFlow session is
actually established and operations are carried out. To achieve speed gains,
within the Flare implementation, we use an object file that is created using
tfcompile which allows us to compile TensorFlow’s operations in addition to Flare’s
beforehand (shown in Figure 16). The query that we run is shown below in Figure
17. After running the query in Flare, we achieve the classification result in
about 1.8 seconds with an accuracy of 100%.
Figure 18: Classifier output – Accuracy Matrix
Figure 18 represents the results of the classification as an accuracy
matrix where the entries in the diagonal represent the correctly classified
records. A zero entry in all positions except the diagonal indicates the best
This independent study served as an introduction to research
carried out in the field of compilers and its applications in machine learning.
It has helped gain experience in Scala programming and understanding concepts
related to generative programming. Further, the performance implications of
utilizing an accelerator such as Flare that facilitates speed gains by
compiling queries to native code and replacing part of Spark’s runtime system
were observed by running sample TPCH queries. The applications of TensorFlow
and its working were explored and the speedups achieved by chaining Flare with
TensorFlow were examined by running a simple crime classification problem.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
Chen, Craig Citro, Greg S. Corrado, Andy Davis,Jeffrey Dean, Matthieu Devin,
Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard,
Rafal Jozefowicz, Yangqing Jia, Lukasz Kaiser, Manjunath Kudlur, Josh
Levenberg, Dan Mané, Mike Schuster, Rajat Monga, Sherry Moore, Derek Murray,
Chris Olah, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar,
Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang
Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems,
2015, from http://download.tensorflow.org/paper/whitepaper2015.pdf .
“Support vector machines speed pattern recognition – Vision
Systems Design”. Vision Systems Design. Retrieved 17 August 2013.
J. Laskowski. Mastering
Apache Spark 2.
M. Armbrust, R. S. Xin, C.
Lian, Y. Huai, D. Liu, J. K. Bradley, X. Meng, T. Kaftan, M. J. Franklin, A.
Ghodsi, and M. Zaharia. Spark SQL: relational data processing in spark. In
SIGMOD, pages 1383–1394. ACM, 2015.
Essertel, Ruby Y Tahboub, James M Decker, Kevin J Brown, Kunle Olukotun,
Tiark Rompf. Flare: Native Compilation for Heterogenous Workloads in Apache
Spark, 23 March 2017.