Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Thanks for writing this! I'm thinking about using Spark for a little 2M-data-point project that I'm working on, just for the learning experience.

Out of curiosity, what kinks did you find?



Essentially the same Spark caveats of lazy evaluation and immutability of caches: neither are a big deal on small datasets, but making a mistake on either on a large dataset can result in a lot of lost time or confusion.

Then there are the massive shuffle read/writes that result in 50GB i/o which are not great for SSDs.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: