Bulk Searches with CIFv4

You know those simple features, where someone asks you a question and it hits you like a brick. After all these years, how did I miss this?

There are no good answers to this question, other than maybe as a network person you're always "one off" querying your threat intel repo (albeit ironically in bulk). Very rarely are you confronted with a list of IP's you have to weed through. If you are, well you just write a script that threads through a bunch of GET /api?q=1.1.1.1 and you're off to the races. As long as the response comes back in a decent amount of time, everything is fine.

That's not what CIF is about though. CIF is about demonstrating how to do things ever smaller, faster and more efficient. Its mission is the FASTEST way to consume threat intel and faster is about removing weight, not adding more threads. As an operator if you have to augment something because of a poor design choice, the system has a bug. It's just sad that it took 4+ versions for me to discover it.

However, as you dive into this "problem" (or feature), it's not as cut and dry as you'd think. When you bulk search for something, your first inclination is to simply run the series of queries on behalf of the user. This effectively solves the problem from their perspective, but it doesn't shed any weight on the system as a whole- it simply moves it somewhere else. You're more or less still performing the same amount of transactions which means implementation might be abstracted, but it's still slow.

The problems we're trying to solve here are:

  1. A simple 'search' function in user-land that "just works" (for 1 query or 100).

  2. Less TCP round trips between us and our store.

    1. Client ← HTTP → Server < -- > ZeroMQ < -- > Store

  3. Less transactional overhead against our store (eg: a single query instead of 100).

  4. Faster overall response time.

  5. Bigger responses, fewer packets.

If I'm able to send a single query in a single HTTP session, that's less HTTP requests. Less HTTP requests means less packet overhead, and less overhead generally leads to faster response times. Faster response times enables my system to respond to more requests which also means my system requires less hardware (eg: cheaper per request). 

Similarly, if i'm able to bulk out my SQL / ES queries into single transactions (either by a giant OR / IN query), that's less TCP/SQL overhead. Over time all that transactional overhead adds up in the form of touching network, touching disk, touching indices and having to move all that individual data back over the network to the client. 

Reducing Weight

If your goal is to provide threat intelligence to thousands of customers over time, those little bytes add up quickly. At first you won't even notice that there's a problem, which is usually because with next to zero users, your system is over built. As you add users you might become aware that there are slow spots in the system. Those issues are usually easily solved by upgrading your instances which you were already prepared to do. 

These seemingly small efficiency issues really don't become obvious until you've started scaling up your operations and discover all those little transactions are crushing the overall system. It's at this point you realize, even if you want to actually solve the underlying problem, you might have to re-architect the system from scratch. You now have a production service that customers rely on and making these fundamental changes to the underlying architecture can have drastic consequences.

Of course, you could spend years over-thinking all the little improvements you could make to achieve the performance of "cat | sort | awk | grep". Years you could have spent getting your product in front of customers to gain valuable insights and feedback. This sort of lesson falls somewhere in between. Things you should think about when prototyping. Things you can learn from others so you know what you should be focusing on next.




Did you learn something new?