- Software Development
We were working on a Spark job to read JSON files out of HDFS, and it seemed to be running way too slowly. It turns out it was the JSON parsing library. So, here are some notes to help others navigate the Scala JSON parsing landscape, where there are at least 6 different libraries -- on both performance and correctness.
Our use case
We needed to de/serialize:
- Files with nested JSON objects, one per line, with both string and numeric values - so basically a
Map[String, Any]. We would like the nested maps to all be scala.Map's.
- A case class with optional fields, some of which are Lists, and ideally should be initialized to
The test data is a 91,582-line file with one JSON blob per line, 34.5MB. The base time for reading this file
using something like
scala.io.Source(logfile).getlines.length is ~130ms.
All benchmarking was conducted on a late model MacBook Pro, using Scala 2.9.3.
A thin Scala wrapper around Jackson, JacksMapper has a super simple and easy-to-use interface.
- 80 seconds -- by far the worst time -- for deserializing the 34.5MB file
- Natively handles deserializing to
Map[String, Any], including nested maps
- case class deserialization - missing optional fields, even Lists which default to Nil are correctly constructed. This library seems to have the best default value initialization around.
Spray-json is based on the parboiled parsing library.
- 8 seconds to deserialize the file - 10x faster than JacksMapper. Woohoo!
- Does not natively unpack to
Map[String, Any]-- we needed to supply a new type class to handle this
- Only treats case class fields of
Option[_]type as optional - any other fields that are missing from the JSON will cause an exception to be thrown. We did not test this out as our case class did not have
- One benefit is that it has a easy API to generate pretty-printed JSON. Oh, and of course it natively integrates with spray, soon to be akka-http.
NOTE: A major new version of spray-json's backend Parboiled parser has been made available, which should result in order-of-magnitude improvements in parsing times. Unfortunately it's not been incorporated into spray-json yet as of the time of this testing.
Jerkson is an abandoned project written by Coda Hale when he was still hacking on Scala.
- Incredibly fast - averaged 650ms for deserializing the whole file!
- Deserializes to
Map[String, Any]but nested maps are
java.util.Maps -- which doesn't meet our original criteria.
This is the official Scala support module for Jackson. It's just as fast as Jerkson, and may have inherited Jerkson's work.
- It's also around 650ms
- Serialization doesn't work.. at least in the REPL. This is rather disappointing. It throws
- Missing case class fields all get initialized with
nulls. This is not bad, I suppose, but I wish proper default values such as
Nilwere used instead.
A very promising project started by the guys from Wordnik (of Swagger fame), it aims to unify Scala JSON ASTs, sports multiple backends (including Jackson), and has native support from both Scalatra and Spray.
- Native deserialization - 940 ms (based on the Lift web framework JSON parser)
- Jackson deserialization - 670 ms
- Can deserialize to
Map[String, Any], including nested ones, but using some clumsy workaround, instead of the native
- Missing case class fields throws an exception. :( Although, you can define alternative constructors to get around part of the issue, and writing a custom type class for deserialization is pretty easy.
- Easy pretty printing (
One thing that json4s has that the others don't, is an extremely rich functional API for transforming the AST. It can also work with XML, apparently.
None of the tested frameworks is perfect. If I had to pick one, I would go with json4s -- it has the most support and features, and with the jackson backend it performs just as fast as jackson-scala-module and jerkson.
All of the frameworks offer rich ASTs for transformation of JSON entities before finally converting back into actual Scala objects. In theory you can build an even faster
I know this post will attract lots of comments from folks saying "But what about XXX?" I apologize in advance; we only had time to test a few that we were considering to improve our correctness and performance, but suggestions are welcome.