• Quick note - the problem with Youtube videos not embedding on the forum appears to have been fixed, thanks to ZiprHead. If you do still see problems let me know.

Internet chaos as Cloudflare goes down.

Hey - it's not that easy - you need to get hold of an old style mini-spotlight bulb, a LED one doesn't produce enough heat.
I have been asked in the past to make a replacement 'bulb' for lava lamps (a mate actually collects them lol- there's no accounting for some peoples tastes...) and it really isn't that hard to make them up for them or indeed any application that needs a 'hot' lamp eg some older egg incubators, reptile cages etc all used bulbs as a heat source in the past...

You can buy 'bulb bases' online readily, and if its just a heat source needed, an appropriately rated resistor does the job fine, for those cases where both heat and light is needed, a resistor coupled with an LED does it... its a couple of minute job to make a 'replacement' bulb/heatsource up.... the parts are readily available online- hell, you can even buy the hand tool to do it although you can do it without one- or even full on production line machines to 'make your own' as a 'production line' job...
1763683816408.png1763683870753.png
 
I generally like cloudflare. They're easy to use, and they're a perfect upstream for my pi-hole. This is a pretty big blunder though.
 
Yep. They do apologise but from the start it's as if the change to the database was a natural event that just happens.
This has more detail.


The change to the database is actually done often as new threats are detected and analysed. It's more of a continual process. For reasons of reliability and performance, the memory is allocated for the rules only once then each monitoring task starts. That means it can only fit a certain amount of rules to fit in that fixed size.

Their code to detect and react on a larger than permitted number of rules is not well written. It just does a hard fail without providing helpful diagnostics. The logic of how to elegantly deal with a larger and more complex set of rules than permitted was never implemented.
 
Questions such as, "Does the database query return the kind of result the programmer expected?" seem like something that should have been tested in a development environment and could easily be tested in staging environment. Not the kind of thing that needs to be thrown into production with fingers crossed. Ostensibly this is something my software team would have caught via review. All code in our critical applications must be approved by two senior software engineers before it can be accepted into the version control system. I know hindsight is 20/20, but programming constructs that on their face seem to produce an unrecoverable error are the kinds of red flags our team notices.

Programming that responds to errors in input data (which would include there being too much of that data) by aborting the program doesn't seem well thought out for a critical, ongoing process. I can immediately think of different programming techniques to mitigate this. But it boils down to simply avoiding allowing an unhandled exceptional condition in production code. I agree with the presenter in the video: it largely doesn't matter what language you use or how it models program exceptions.
 
The hardest part for Cloudflare is that they let the cat out of the bag too. When flaws like these get caught it generally sends signs to hackers of what type of flaws the company is prone to and what their processes are to handle data. Cloudflare better be in panic mode right now.

They always seemed like they had it together. The sheer time involved in managing traffic to 1.1.1.1 when that project took off had to be crazy.
 
Not every aspect of a global scale production system can be tested outside of that system.
These are billion-dollar companies. They can build big enough environments to test all their distributable components.

Also, decent testing should involve edge conditions such as query response overloads. "What happens if this query returns the whole database? Is that possible, and how? Is that technically a bad situation? How do we handle that? How might we prevent that?" Etc.
 
One thing the organisation could do is implement any changes for one organisation and see what happens. If it works then do a few more. Then repeat until it is fully implemented. Worst case only a few organisations go down for a few minutes until they have backed out of the change.
 
Last edited:

Back
Top Bottom