They’re shifty fellows, alright! And that’s saying something coming from a lizard with a mood ring for an epidermis. The second you think you got these phishy fraudsters pegged, poof! Off to some shady corner of the web. Keeping up with them? Let’s just say it’s trickier than blending in at a kaleidoscope convention (ask me how I know). What we need is a clever way to pin these slippery suckers down, a fingerprinting of sorts that they can’t so easily shake off like yesterday’s skin. We need a good IPv4 web scanner!
Your web scanner shouldn’t be the average bug snatcher. Think, entire IPv4 space daily with 90+ web characteristics. I don’t care how good your camouflage abilities are, every chameleon retains a few distinct patterns when changing colors. Can you imagine how confusing it would be to remember who’s who if they didn’t? Websites are no exception. With a good scanner, we can track these patterns like html titles, html body ssdeep (aka fuzzy hashing), favicons, SSL certificates and more. “But Kodama,” you say, “What good is all that if you don’t know what to look for?” I’m glad you asked.
Let the machines do the heavy lifting—supervised, of course. With supervised machine learning, we teach the model to snag those fraudsters faster than flicking out my tongue. Let me break it down into byte-sized chunks (OK, maybe not that small).
We need an initial batch of starter data to which we can apply labels. This dataset can be small and manageable, nothing to lose your tail over! Perhaps use your organization’s public internet assets, and we’ll label them with simple categories like “legitimate,” “malicious,” and “benign.” No need for anything overly complicated.
Once we’ve got our labeled data, it’s time to select which web features we’ll be using for the hunt. The features you choose will change the true positive to false positive flavor and how much data your system has to chew. The secret sauce is in fine tuning your system over time, catching as many true positives as possible without overwhelming your analyst. A good starting lineup includes HTML titles, SSDEEP hashes of the html body, favicon hashes, and SSL certificates.
Next, we score the individual values for each chosen feature. Stick with me as the math’s nothing too crazy:
$$\frac{2TP \times 100}{2TP + FP + (k \times \text{Total})}$$
We use TP as true positives, FP for false positives, Total as the number of URLs sharing a particular feature, and k is a smoothing factor. You can set k between 0 and 1 — let’s say 0.2. The x100 is purely optional since it’s a lot cleaner looking to set score thresholds as integers between 0 and 100 than as floats between 0 and 1. The idea is to avoid punishing features that don’t have much data, while still evaluating their performance.
Worried that you don’t have enough malicious URLs in your dataset? Don’t sweat it! This approach isn’t just for finding fraudsters; did I mention it’s perfect for attack surface management (ASM) too? Monitor your own assets, and if someone starts impersonating your organization, you’ll catch them soon enough.
Once you’ve got your scores, you can use them to create new queries. If the data flow is slow, lower your thresholds. Too much noise? Raise them. If one feature’s noisier than the rest, just tweak its threshold without disrupting the whole system. By the end of this process, you’ll have a list of high-value queries ready to run in your scanner, pulling down more juicy data for you to label.
And guess what? You’ve now looped back to step one! Label, refine, repeat. Feeling fancy? You can even automate the labeling. But remember—data processing is a whole other step in the threat intelligence lifecycle, and that’s a tale for another blog.
So, to summarize our fraudster-finding recipe: start with a generous helping of web scanning, sprinkle in some labels, stir up those features according to a good scoring equation, and pop it all into a search query to start the cycle again. Stretching the cooking analogy a bit too much? Yeah, probably… Ahem! Most definitely. But maybe it’s just my inner chef trying to escape. Who knew a chameleon could be such a foodie? OK, I’ll stop before I start mixing metaphors like a blender on overdrive.