Thanks for writing this! I'm thinking about using Spark for a little 2M-data-poi...

minimaxir · on Jan 19, 2017

Essentially the same Spark caveats of lazy evaluation and immutability of caches: neither are a big deal on small datasets, but making a mistake on either on a large dataset can result in a lot of lost time or confusion.

Then there are the massive shuffle read/writes that result in 50GB i/o which are not great for SSDs.