Open source big data technologies (BDT) like Apache Spark or Apache Flink are being widely used to distributed processing of huge amount of data (mostly counting words ;)).

On the other side, complex data analysis can be “*easily”* perform using open source libraries (i.e. machine learning frameworks like scikit-learn, caret) in languages like Python or R. These libraries are usually based on optimized libraries written in low level languages like C or Fortran but providing a more friendly coding style.

Unfortunately, combining both and doing distributed complex data analysis *à la* big data is a topic not as mature as one would like. In my opinion, an Aquilles’ heel about BDT is the absence of optimized distributed linear algebra libraries like in High Performance Computing (HPC) (i.e. PBLAS, ScaLAPACK, PARPACK). But that is another story and shall be told another time …

In this post, I would like to explore the possibilities that big data technologies (will) provide to perform distributed numerical simulations. Numerical simulations is a broad field, so let’s consider a concrete example, the heat equation:

This equation is deeply connected with many interesting problems in physics, chemistry or financial mathematics, so I consider it may be an interesting starting point.

Considering a *naïve *approach, let me use the* explicit form* to integrate the heat equation. After a bit of algebra it’s obtained:

where and .

Leaving aside stability and precision problems, we would like to integrate this equation in a parallel distributed way. If you have been in touch with HPC and MPI, a straightforward strategy could be:

- splitting the lattice in blocks (points and boundary term)
- assign each block to a processor
- integrate the points
- update the boundary
- iterate.

Unfortunately, using BDT we cannot do that. In this case, data is going to be distributed *randomly* through the nodes. In principle, you don’t know who has each data either its neighbours. So, we have to think about a parallel algorithm under this assumption. What about this?:

- Consider a configuration as pairs: (coordinate, values).
- For each pair, create copies with coordinates corresponding to its neighbours.
- Group all the pairs with the same coordinates.
- For each group, operate to obtain the updated value.
- iterate.

Maybe this sounds a bit weird if you are coming from HPC, but in terms of the functional programming paradigm used by BDT seems to be quite a natural approach. In addition, there is an extra bonus, coding these functions in BDT is extremely simple.

Let’s see how to do this using scala. First of all, we define a general function for a single iteration:

// Create three case class to organise the the data case class Coor(x: Int, y: Int, z: Int) case class Value(T: Double, Q: Double) case class Point(coor: Coor, value: Value) // Define a function to do a single iteration def update(field: RDD[Point], expand: Point => List[Point], reduce: (coor, List[Point])]): RDD[Point] = field .flatMap( expand(_)) .groupBy(_.coor) .map( case (coor,values) => reduce(coor,values.toList) )

What is this function doing?

Let’s understand the signature first: the function receives a set of **Point** (**field**), a function (**expand**) that receives a **Point** and return a set of **Point** and a function (**reduce**) that receives a **Coor** and a set of **Point **and returns a Point, finally **update** returns a set of **Point**.

The body of the function is very succinct, and it can be related directly with step 2-4 defined previously.

This function is quite general, so one can use it to solve other differential equations, Monte Carlo simulations, … just defining the proper **expand** and **reduce** functions ( yeah … sometimes functional programming is amazing).

In this case, we are considering the heat equation, so let’s give an example of **expand** and **reduce** functions. The **expand** function receives a **Point** and returns a list of **Points** with the coordinates corresponding to its neighbours and values according to Eq. (2). In this case the neighbours are first neighbours in each direction, then we can write:

def expand(p: Point): List[Point] = { val Point(Coor(x,y,z),Value(T,Q)) = p List ( Point(Coor(x,y,z), Value(c1*T + c3*Q,Q)))), Point(Coor(x + 1,y,z), Value(c2*T,0.0)), Point(Coor(x,y + 1,z), Value(c2*T,0.0)), Point(Coor(x,y,z + 1), Value(c2*T,0.0)), Point(Coor(x - 1,y,z), Value(c2*T,0.0)), Point(Coor(x,y - 1,z), Value(c2*T,0.0)), Point(Coor(x,y,z - 1), Value(c2*T,0.0)) ) .filter(!isOutBox(_.coor)) }

Since we prepare the data in the expand function, the reduce function is quite simple, we can just sum the values:

def reduce(coor: Coor, values: List[Point]): Field = if (isBoundary(coor)) Point(coor,Boundary(coor)) else Point(coor, values .map{case Field(p, f) => f} .foldLeft(Value(0.0,0.0)){ (R, e) => Value(R.T + e.T,R.Q + e.Q) }

That’s all, clean and simple …

In these links (heatbath_Spark, heatbath_Flink) you can find a complete implementation for both technologies. As you can see they are the same at ~90%.

I don’t want to blur this post with efficiency comparatives between both Spark and Flink, because thas is an interesting topic itself and I would like to treat it in a future post.

## One thought on “Big Data beyond Machine Learning: a heating example”