Talk on Big Data Tools and Architectures at Foo Cafe

I went to a really interesting talk at Foo Cafe in Stockholm tonight. I stayed almost 2 hours over my time limit, but it was so worth it. The talk was by Martin Neumann, and he gave us an introduction to big data and some tools commonly used to wrangle it. Below are the notes I took throughot the talk. I cleaned them up a tiny bit, but mostly these are just for my personal benefit and are super messy - if you didn’t attend the talk they may not be that useful.

Martin gave a couple of really great real world examples of companies he’s worked with on various big data projects. Though he had permission to use the examples in the talk, I’m not really sure if he’d mind my posting specific company names here, so I’ve decided to just refer to the companies in very general category terms. If I get the OK to replace this with actual names later I’ll do that.

Random Stuff


The “Five Vs”.

1. Volume

Volume challenges: Storage; Redundancy/backups; discoverability/filtering; processing/analysis

2. Variety

Traditional data: database, structured fields, rules, etc. Big data: more of a dump, Unstructured - videos, text, random urls, etc

Variety Challenges: Data is often quite noisy, needs cleaning; no context in data; disambiguation - is it about Paris the city or Paris the person?

Content Streaming Company X example: they get meta info about content from the labels; people manually add labels OR they have a db connection. Which “Kent” is this new piece of content from? There are 7-8 creators with name Kent. no global identifier for creators.

3. Veracity

Challenges: data quality is quite low; errors, missing data, unexpected structure, etc.

Content Streaming Company X eg: metadata from labels may be just wrong; some creators put in misleading data intentionally to broaden viewership.

4. Value

Data and knowledge is not the same thing. What are these 5 million tweets about?

Challenges: data is too big to just look at/visualize. Traditional analytics tools often do not work at large scales

5. Velocity

Data comes at a very fast rate. NY Stock exchange produces ~1TB of trade info per day. Not that big. But you need to process it in milliseconds. Storage on single machine would be way too slow.


Multiple machines because:

With distributed and big data systems, failure is normal/the norm. Something fails all the time.


CAP Theorem: there can only be two: Consistency, Availability, Partition Tolerance

There are some levels, but in general pick 2 of the above.

Bank transactions/systems are not usually done in big data stacks; there is usually a single server that runs that stuff somewhere.


Data Analysis Framework

Big Data systems based on above:

Data source:



Hadoop HDFS

To Write:

To Read: Ask coordinator for file, it reconstructs file from the storage nodes. Chunks can be retrieved in parallel

Speed more dependent on network vs machine hard drive. Fast reads. This system is industry standard in big data storage. Usually installed on cloud servers/VMs. Storage nodes need to communicate with each other.

KAFKA message broker; Low latency storage

Used for streaming data

Write op:

Read op:

Unlike batch storage, this is a message broker. Storage is there “forever”. MB has a limited lifespan for its messages/events. If you need a data long term you can write it from message broker to HDFS.



Working on functional program against a data stream.



Kibana + ElasticSearch:





Hops (hadoop for humans)



MS Azure

Your own hardware



Example 1: Big Tech Company Y Research (not in productoin, partial success)

You don’t want to find one random error message, but the sweet spot of likely bad cases that you can still do something about before everything explodes. System was taking error messages and predicting if it was still OK or if something is so abnormal that action needs to be taken.

Challenge: Not enough time to build it, the algorithm was complicated. Algorithm IS in use in different products, eg anomaly detection for ship movements.

Example 2: Car industry, research prototype, not deployed yet

Analyze hazard warning: someone parking in the wrong lane vs traffic jam? what warning lights are more relevant to others?

Challenge: Was no big data in the system - 100 events or so. Had to simulate large amounts of events to show infrastructure capability

Example 3: Big Tech Company Y production project


Big Tech Company Y switch from openstack to kubernetes because:

Container solutions like kubernetes will likely replace VM solutions like openstack. kubernetes is much smaller and faster, spinning up a VM takes a long time (~15 mins). kubernetes can spin up a new container in ~2 seconds. If something fails you can spin something back up to replace it before anyone notices.



What are some challenges you are having?



© - 2021 · Liza Shulyayeva ·