This month at DerbyPy I provided a high level introduction to PySpark. For this talk I went over the Spark execution model at a high level, talked about the difference between the PySpark Dataframe and RDD api, and provided some examples of how to use both. As part of this I put together a jupyter notebook and some scripts that can be used via
spark-submit along with instructions on how to run
If you're interested in the material and presentation they can be found here.