Rdd Auction - The Creative Blog

An RDD is, essentially, the Spark representation of a set of data, spread across multiple machines, with APIs to let you act on it. An RDD could come from any datasource, e.g. text files, a database via JDBC, etc. The formal definition is: RDDs are fault-tolerant, parallel data structures that let users explicitly persist intermediate results in memory, control their partitioning to optimize ...

I'm just wondering what is the difference between an RDD and DataFrame (Spark 2.0.0 DataFrame is a mere type alias for Dataset[Row]) in Apache Spark? Can you convert one to the other?

RDD stands for Resilient Distributed Datasets. It is Read-only partition collection of records. RDD is the fundamental data structure of Spark. It allows a programmer to perform in-memory computations In Dataframe, data organized into named columns. For example a table in a relational database. It is an immutable distributed collection of data.

Pair RDD is just a way of referring to an RDD containing key/value pairs, i.e. tuples of data. It's not really a matter of using one as opposed to using the other. For instance, if you want to calculate something based on an ID, you'd group your input together by ID. This example just splits a line of text and returns a Pair RDD using the first word as the key [1]: val pairs = lines.map(x ...

The APIs RDD It's the first API provided by spark. To put is simply it is a not-ordered sequence of scala/java objects distributed over a cluster. All operations executed on it are jvm methods (passed to map, flatmap, groupBy, ...) that need to be serialized, send to all workers, and be applied to the jvm objects there.