How to protect your website that is on Cloudflare plus Logs


How to protect your website that is on Cloudflare ?

(Last update May. 6. 2024.)


Don’t expect that if you just set up the CF nameservers everything will be well.
Cloudflare is a tool, that you need to understand, learn and put to work for you after proper configuration.


We often have the opportunity here on the forum to see that some websites have been suspended for exceeding the limit and mostly this is caused by a bunch of bots doing a vulnerability scan and searching sensitive directories and files on your site.

Some of these bots are very reckless and do over 30 requests in one sec!

An example of what it looked like on my website:






Analytics from CF (nearly a million requests and generated 20GB of traffic)





Fortunately, a large part of my website is static content, so everything was served by CF (20GB).

Imagine what it’s like for someone who has WordPress and every time the server has to generate the pages that BOT requests.

This kind of behavior drains system resources, hogs bandwidth, and prevents your site from operating at maximum capacity.




After my reaction, the situation is like this




To protect yourself


You first need to see the logs which are not possible *1 if you use the free plan on Cloudflare


Currently, to install Logflare on Cloudflare, use these instructions
and ignore the others = How to do logging in Logflare and its installation on Cloudflare

(So it is good to install this application on Cloudflare Cloudflare Apps
Official site and help https://logflare.app/
You can find more information about it here Logflare CloudFlare App - Review | Nick Samuel)


*1 Thanks to @AnimalTheGamer we also have an alternative to Logflare directly on CF - look here



Once you set up the app

You can access the logs on their website or like me generate a public link which you can then put in bookmarks, so you don’t have to log into their system…
Instead of that - you just watch logs directly in real time.



Scroll until you see this section (and generate a URL)



Then look in logs for unwanted guests

and by what these bots do on your site (request) you generate firewall rules on your CF


In this example, you see that the bot searches the file wlwmanifest.xml non-stop in different locations.

and bots can often try over 500 locations in a matter of seconds which of course burdens your origin



So instead of dealing with IP blocking nonstop

(this doesn’t mean you shouldn’t do it)
you can shorten the job by generating FW rules

Hint: use OR in FW rules so you save a number of free rules




The example I use is just for my website !!!
because I don’t have WordPress and I can then “easily” decline some requests


If anyone wants these rules (note I don’t use WP)

you can click on CF FW rules edit expression

and paste my code inside

exss


Rule 1 (user agents) bad bots engines

(http.user_agent contains "python-requests") or (http.user_agent contains "Python-urllib") or (http.user_agent contains "Apache-HttpClient") or (http.user_agent contains "Nuclei") or (http.user_agent contains "MJ12bot") or (http.user_agent contains "httpx") or (http.user_agent contains "MauiBot")


Rule 2 (files and paths) vuln. scan

(http.request.uri.path contains "wlwmanifest.xml") or (http.request.uri.path contains "wp-includes") or (http.request.uri.path contains "xmlrpc.php") or (http.request.uri.path contains ".env") or (http.request.uri.path contains "wp-content") or (http.request.uri.path contains "eval-stdin.php") or (http.request.uri.path contains "wp-login") or (http.request.uri.path contains "wp-admin")


The code above (about WordPress dirs) can also be shortened quite a bit by using this

(http.request.uri.path contains "/wp-")

but I intentionally did not use it because maybe someone has such a beginning of the name for file or dir
so it would be problematic.




It is advisable to do this also - so that no one circumvents the FW rules

URL normalization modifies separators, encoded elements, and literal bytes in incoming URLs
so that they conform to a consistent formatting standard.
For example, consider a firewall rule that blocks requests whose URLs match example.org/hello.
The rule would not block a request containing an encoded element — example.org/%68ello.
Normalizing incoming URLs at the edge helps simplify Cloudflare firewall rules expressions that use URLs.

For more information about chars and more see here.




Besides, new users on CF have this ON by default, but for older users - turn it on


If the bots pass the JS challenge you can further reduce their expiration time here





block IP here


If you are not sure about an IP (or ASN - AS ) before the block be sure to look at whois/IPinfo

and some of the sites like (enter the IP in the search field )



And/or click on the metadata in the log (Logflare)




Monitor the situation here

image




And then you spend approx 10 days watching the logs every day and checking analytics,

and after that, you have created enough FW rules to repel most malicious bots.

Periodically do some necessary updates to existing Firewall and IP rules.



You can add to your firewall rule an exclusion to the “Known Bots” operator

so those legit crawlers such as Googlebot aren’t blocked if they touch a URL that is blocked (trigger) by some other FW rules.

The list of bots is maintained by the CF and checked by IP and other validation methods,
so you have nothing extra to do other than turn it ON and allow it.

Cloudflare verified Bots list


Just in case, move the rule to the first position (grab and move)




All this above is the starting point

I intentionally did not share my complete configuration
because I don’t want to jeopardize my site by publicly showing someone how to circumvent all my rules

But one of the more important activities is to block hosting (WAF/tools)
most attacks/scans will come from there

Digitalocean = AS14061
Amazon = AS14618 , AS16509
Microsoft = AS8075
OVH = AS16276

(and hundreds more on the list)

Of course, some online sites (tools) that are on those servers will no longer work because your site will deny them access
so if necessary, you can temporarily turn off a block and when you finish with the desired “test” (online) re-enable the block.



Useful - to understand the order

Useful links for better understanding




In addition to everything mentioned at the beginning of the topic

it is preferable to allow only Cloudflare servers to access the origin

note: it is recommended to put that part of the code on top of the .htaccess code

CF side:

35013f5b8d6e09e129330a63604ebf16d7114bf6_2_690x434

986ba9578fd8e4ab643cff43f77beca6c658c4fa_2_690x77





At some point, if attacks go beyond normal complexity,
you need to add a rate limit, session inspection, and fingerprinting. Going beyond that is most of the time absurd.
Relying on CAPTCHA or JS Challenges for complex attacks is foolish, it’s always been like that.
Everybody knows that attacks can solve those two challenges, these layers are made to add costs and complexity to attacks.
When bot operators realize that their bots are being blocked, many will simply stop trying and look for another victim instead (hopefully not you).



Good luck and thank you for your time :slightly_smiling_face:

21 Likes

Wow, very comprehensive! Thanks for making this guide! :grinning:

5 Likes

Thanks !

It is a process that never stops :slight_smile:

I wrote it mostly to show users how to see logs.

So it is much easier if you have some information that you get from the logs
and with this information, you can profile attackers, get a bigger picture,
and at the same time be shocked by the fact that a large part of your site visitors is not human at all :joy:



Then it’s up to us to divide (depending on the behavior) bots into good and bad bots


Good - are those who first visit the robots.txt file and follow the rules in it
and for example make 1 request in 5 seconds
and after e.g. 10REQ wait a few minutes (not to overload your site).

Good are part of some search engine that makes sense for your site.


Bad - are those who don’t care and demand the same thing all-day
although the server returns 404 properly.

Bad are those which are part of some vuln. scan / pen test programs (mostly downloaded from git) and then run mostly on DigitalOcean, Google or Microsoft servers.
Crawlers for which it is not known why it collects information.

  • Web scraping - Hackers can steal web content by crawling websites and copying their entire contents. Fake or fraudulent sites can use the stolen content to appear legitimate and trick visitors.

  • Data harvesting - Aside from stealing entire websites’ content, bots are also used to harvest specific data such as personal, financial, and contact information that can be found online.

  • Price scraping - Product prices can also be scraped from ecommerce websites so that they can be used by companies to undercut their competitors.

  • Brute-force logins and credential stuffing - Malicious bots interact with pages containing log-in forms and attempt to gain access to sites by trying out different username and password combinations.

  • Digital ad fraud - Pay-per-click (PPP) advertising systems by using bots to “click” on ads on a page. Unscrupulous site owners can earn from these fraudulent clicks.

  • Spam -Bot can also automatically interact with forms and buttons on websites and social media pages to leave phony comments or false product reviews.

  • Distributed denial-of-service attacks - Malicious bots can be used to overwhelm a network or server with immense amounts of traffic. Once the allotted resources are used, sites and applications supported or hosted by the network will become inaccessible.





Here’s a little help working with logs (search)


click on metadata = more info

m is short for metadata


status listing 200 m.response.status_code:200 c:count(*) c:group_by(t::minute)


Listing errors 5xx m.response.status_code:>499 c:count(*) c:group_by(t::minute)

Or list those larger than 4xx
and with that, you can find/isolate some bots that dig your website
and the server responds with e.g. error 404 because the request does not exist.

Calendar (yesterday log, etc.)

The mouse above the graph gives this

Click on the chart (e.g. red) and it automatically leads there
and lists all 5xx errors

You can also see when the Cloudflare bot checks health (and if something is wrong CF notify me by e-mail)

6 Likes

might just want to decline user agent contains python

python-requests use some online tools, for example:
you made some txt record on CF and you want to check if it’s visible and if it’s okay (valid)
and such online sites (tools) are often on amazon and have that user agent string.

So you should definitely pay attention
and customize the blocking code depending on what is happening on your domain

4 Likes

auso, the “Mauibot and MU12bot” filters are blocking googlebot

That’s not true - check that you may not have blocked Google IP by accident

here are image of how they visit regularly

3 Likes

i haven’t

You did not specify how you know google is blocked
maybe robots.txt is to blame

or you do not have a valid TLS certificate
or something else in the CF settings

You can check here https://search.google.com/test/rich-results

and then look at the log…you should see google bot there

2 Likes

its getting blocked in cf

I made an update of the article here
if it helps you and others

The only thing I saw in your case was that a simple visit from Google to the domain was already blocked and it seems to me that you have some protection plugin (soft) which is probably the culprit for the block,
and not some trigger you mentioned in your post.

1 Like

Well done and thanks for sharing.

3 Likes

Thank you so much for this useful information.

3 Likes

i wanted to try this so i installed the app but its way too complicated, i cant even find the edit button to generate a link :frowning:

1 Like

++ the cloudflare interface has changed .



  1. click your “source”


  1. find EDIT on top right
4 Likes

Here’s how to prevent this: Block ALL bots (expect trusted bots)!

This way you won’t have an insane 20GB of traffic (maybe more, maybe less)

3 Likes

thanks everyone :smiley:

2 Likes

I forgot to mention one important action
so I added it to the article today, and a few more sentences around.

2 Likes

Yes, normalization is important.

Heck, I should normalize URLs before I handle them all the time (I have always had this problem).

Sometimes, redundancy really screws up our security XD

2 Likes