In the right use case, bloom filters look like magic. That's a bold statement, but in this tutorial we'll explore this strange data structure, how to best use it, and some practical examples using Redis and Node.js.
The Bloom filter is a probabilistic, one-way data structure. The word "filter" can be confusing in this context; filter means it's an active thing, a verb, but it might be easier to think of it as storage, a noun. With a simple bloom filter you can do two things:
- Add an item.
- Check if an item has not been added before.
These are important limitations to understand - you cannot delete items, nor can you list items in a bloom filter. Additionally, you cannot determine whether an item has been added to the filter in the past. This is where the probabilistic nature of Bloom filters comes into play - false positives are possible, but false positives are not. If the filter is set up correctly, the chance of false positives is very small.
Variants of Bloom filters exist that add additional functionality such as removal or scaling, but they also add complexity and limitations. Before moving on to variations, it is important to first understand a simple bloom filter. This article only introduces simple Bloom filters.
With these limits, you get many benefits: fixed size, hash-based encryption, and fast lookups.
When you set up a bloom filter, you need to specify a size for it. This size is fixed, so if there are one or billion items in the filter, it will never grow beyond the specified size. As you add more items to the filter, the likelihood of false positives increases. If you specify a smaller filter, the false positive rate will increase faster than if you use a larger filter.
Bloom filters are built on the concept of one-way hashing. Much like correctly storing passwords, Bloom filters use a hashing algorithm to determine the unique identifier of the item passed into it. A hash is essentially irreversible and is represented by a seemingly random string of characters. Therefore, if someone gains access to a bloom filter, it will not directly reveal anything.
Finally, bloom filters are fast. This operation involves far fewer comparisons than other methods and can be easily stored in memory, preventing performance-impacting database hits.
Now that you understand the limitations and advantages of Bloom filters, let's look at some situations where they can be used.
set up
We will illustrate Bloom filters using Redis and Node.js. Redis is the storage medium for Bloom filters; it's fast, in-memory, and has specific commands (GETBIT
, SETBIT
) that make implementation more efficient. I assume you have Node.js, npm, and Redis installed on your system. Your Redis server should be running on the default port on localhost
for our example to work properly.
In this tutorial, we will not implement a filter from scratch; instead, we will implement a filter from scratch. Instead, we'll focus on a practical use of a pre-built module in npm: bloom-redis. bloom-redis has a very concise set of methods: add
, contains
, and clear
.
As mentioned before, bloom filters require a hashing algorithm to generate an item's unique identifier. bloom-redis uses the well-known MD5 algorithm, which works fine although it may not be suitable for Bloom filters (a bit slow, a bit overkill).
Unique username
Usernames, especially those that identify the user in the URL, need to be unique. If you build an application that allows users to change their username, then you may want a username that is never used to avoid username confusion and attacks.
Without bloom filters, you would need to reference a table containing every username ever used, which can be prohibitively expensive at scale. Bloom filters allow you to add an item every time a user adopts a new name. When a user checks to see if the username is taken, all you need to do is check the bloom filter. It will be able to tell you with absolute certainty whether the requested username has been added previously. The filter may incorrectly return that the username has been taken when in fact the username has not been taken, but this is just a precaution and does not cause any real harm (other than that the user may not be able to declare "k3w1d00d47").
To illustrate this, let's build a fast REST server using Express. First, create the package.json
file and then run the following terminal command.
npm install bloom-redis --save
npm install express --save
npm install redis --save
The default option size for bloom-redis is set to 2 MB. That's wrong out of caution, but it's quite large. Setting the size of the bloom filter is critical: too large and you waste memory, too small and the false positive rate will be too high. The math involved in determining the size is complex and beyond the scope of this tutorial, but luckily there is a bloom filter size calculator that does the job without having to crack a textbook.
Now, create app.js
as follows:
var Bloom = require('bloom-redis'), express = require('express'), redis = require('redis'), app, client, filter; //setup our Express server app = express(); //create the connection to Redis client = redis.createClient(); filter = new Bloom.BloomFilter({ client : client, //make sure the Bloom module uses our newly created connection to Redis key : 'username-bloom-filter', //the Redis key //calculated size of the Bloom filter. //This is where your size / probability trade-offs are made //http://hur.st/bloomfilter?n=100000&p=1.0E-6 size : 2875518, // ~350kb numHashes : 20 }); app.get('/check', function(req,res,next) { //check to make sure the query string has 'username' if (typeof req.query.username === 'undefined') { //skip this route, go to the next one - will result in a 404 / not found next('route'); } else { filter.contains( req.query.username, // the username from the query string function(err, result) { if (err) { next(err); //if an error is encountered, send it to the client } else { res.send({ username : req.query.username, //if the result is false, then we know the item has *not* been used //if the result is true, then we can assume that the item has been used status : result ? 'used' : 'free' }); } } ); } }); app.get('/save',function(req,res,next) { if (typeof req.query.username === 'undefined') { next('route'); } else { //first, we need to make sure that it's not yet in the filter filter.contains(req.query.username, function(err, result) { if (err) { next(err); } else { if (result) { //true result means it already exists, so tell the user res.send({ username : req.query.username, status : 'not-created' }); } else { //we'll add the username passed in the query string to the filter filter.add( req.query.username, function(err) { //The callback arguments to `add` provides no useful information, so we'll just check to make sure that no error was passed if (err) { next(err); } else { res.send({ username : req.query.username, status : 'created' }); } } ); } } }); } }); app.listen(8010);
To run this server: node app.js
. Go to your browser and point it to: https://localhost:8010/check?username=kyle
. The response should be: {"username":"kyle","status":"free"}
.
Now, let's save that username by pointing your browser to http://localhost:8010/save?username=kyle
. The response will be: {"username":"kyle","status":"created"}
. If the return address is http://localhost:8010/check?username=kyle
, the response will be {"username":"kyle","status ":"used"}
.Similarly, returning http://localhost:8010/save?username=kyle
will result in {"username":"kyle","status":"not -created"}
.
From the terminal you can see the size of the filter:
redis-cli strlen username-bloom-filter
.
Now, for one item, it should read 338622
.
Now, go ahead and try to add more usernames using the /save
route. You can try as many as you want.
If you check the dimensions again, you may find that the dimensions have increased slightly, but not with every addition. Curious, right? Internally, the bloom filter sets individual bits (1/0) at different locations in the string stored in username-bloom. However, these are not contiguous, so if you set a bit at index 0 and then set a bit at index 10,000, everything in between will be 0. For practical purposes, it's not important to understand the precise mechanics of each operation at first, just know that this is normal and you will never store more in Redis than you specify.
Fresh content
Fresh content on the website can attract users to return, so how to show new content to users every time? Using a traditional database approach, you would add a new row to a table containing the user identifier and story identifier, and then query the table when you decide to display a piece of content. As you might imagine, your database will grow very quickly, especially as your users and content grow.
In this case, the consequences of false negatives (e.g. not showing unseen content) are very small, making bloom filters a viable option. At first glance, you might think that each user needs a Bloom filter, but we'll use a simple concatenation of a user identifier and a content identifier, and then insert that string into our filter. This way we can use a single filter for all users.
In this example, let's build another basic Express server that displays content. Each time you access the route /show-content/any-username
(any-username is any URL-safe value), a new piece of content will be displayed until the site is empty of content. In the example, the content is the first line of the top ten Project Gutenberg books.
We need to install another npm module. Run from terminal:
npm install async --save
Your new app.js file:
var async = require('async'), Bloom = require('bloom-redis'), express = require('express'), redis = require('redis'), app, client, filter, // From Project Gutenberg - opening lines of the top 10 public domain ebooks // https://www.gutenberg.org/browse/scores/top openingLines = { 'pride-and-prejudice' : 'It is a truth universally acknowledged, that a single man in possession of a good fortune, must be in want of a wife.', 'alices-adventures-in-wonderland' : 'Alice was beginning to get very tired of sitting by her sister on the bank, and of having nothing to do: once or twice she had peeped into the book her sister was reading, but it had no pictures or conversations in it' }
If you pay careful attention to the round trip time in the development tools, you will find that the more times you request a single path with the username, the longer it takes. While checking the filter takes a fixed amount of time, in this case we are checking for the presence of more items. Bloom filters are limited in what they can tell you, so you are testing the presence of each item. Of course, in our example it's fairly simple, but testing hundreds of projects is inefficient.
Outdated data
In this example, we will build a small Express server that will do two things: accept new data via POST, and display the current data (using a GET request). When new data is POSTed to the server, the application checks whether it exists in the filter. If it doesn't exist we will add it to the collection in Redis, otherwise we will return null. A GET request will get it from Redis and send it to the client.
This is different from the first two situations, false positives are not allowed. We will use bloom filters as the first line of defense. Given the properties of bloom filters, we can only be sure that something is not in the filter, so in this case we can continue to let the data in. If the bloom filter returns data that might be in the filter, we check against the actual data source.
那么,我們得到了什么?我們獲得了不必每次都檢查實(shí)際來源的速度。在數(shù)據(jù)源速度較慢的情況下(外部 API、小型數(shù)據(jù)庫、平面文件的中間),確實(shí)需要提高速度。為了演示速度,我們在示例中添加 150 毫秒的實(shí)際延遲。我們還將使用 console.time
/ console.timeEnd
來記錄 Bloom 過濾器檢查和非 Bloom 過濾器檢查之間的差異。
在此示例中,我們還將使用極其有限的位數(shù):僅 1024。它很快就會填滿。當(dāng)它填滿時(shí),它將顯示越來越多的誤報(bào) - 您會看到響應(yīng)時(shí)間隨著誤報(bào)率的填滿而增加。
該服務(wù)器使用與之前相同的模塊,因此將 app.js
文件設(shè)置為:
var async = require('async'), Bloom = require('bloom-redis'), bodyParser = require('body-parser'), express = require('express'), redis = require('redis'), app, client, filter, currentDataKey = 'current-data', usedDataKey = 'used-data'; app = express(); client = redis.createClient(); filter = new Bloom.BloomFilter({ client : client, key : 'stale-bloom-filter', //for illustration purposes, this is a super small filter. It should fill up at around 500 items, so for a production load, you'd need something much larger! size : 1024, numHashes : 20 }); app.post( '/', bodyParser.text(), function(req,res,next) { var used; console.log('POST -', req.body); //log the current data being posted console.time('post'); //start measuring the time it takes to complete our filter and conditional verification process //async.series is used to manage multiple asynchronous function calls. async.series([ function(cb) { filter.contains(req.body, function(err,filterStatus) { if (err) { cb(err); } else { used = filterStatus; cb(err); } }); }, function(cb) { if (used === false) { //Bloom filters do not have false negatives, so we need no further verification cb(null); } else { //it *may* be in the filter, so we need to do a follow up check //for the purposes of the tutorial, we'll add a 150ms delay in here since Redis can be fast enough to make it difficult to measure and the delay will simulate a slow database or API call setTimeout(function() { console.log('possible false positive'); client.sismember(usedDataKey, req.body, function(err, membership) { if (err) { cb(err); } else { //sismember returns 0 if an member is not part of the set and 1 if it is. //This transforms those results into booleans for consistent logic comparison used = membership === 0 ? false : true; cb(err); } }); }, 150); } }, function(cb) { if (used === false) { console.log('Adding to filter'); filter.add(req.body,cb); } else { console.log('Skipped filter addition, [false] positive'); cb(null); } }, function(cb) { if (used === false) { client.multi() .set(currentDataKey,req.body) //unused data is set for easy access to the 'current-data' key .sadd(usedDataKey,req.body) //and added to a set for easy verification later .exec(cb); } else { cb(null); } } ], function(err, cb) { if (err) { next(err); } else { console.timeEnd('post'); //logs the amount of time since the console.time call above res.send({ saved : !used }); //returns if the item was saved, true for fresh data, false for stale data. } } ); }); app.get('/',function(req,res,next) { //just return the fresh data client.get(currentDataKey, function(err,data) { if (err) { next(err); } else { res.send(data); } }); }); app.listen(8012);
由于使用瀏覽器 POST 到服務(wù)器可能會很棘手,所以讓我們使用curl 來測試。
curl --data“您的數(shù)據(jù)放在這里”--header“內(nèi)容類型:text/plain”http://localhost:8012/
可以使用快速 bash 腳本來顯示填充整個(gè)過濾器的外觀:
#!/bin/bash for i in `seq 1 500`; do curl --data “data $i" --header "Content-Type: text/plain" http://localhost:8012/ done
觀察填充或完整的過濾器很有趣。由于這個(gè)很小,你可以使用 redis-cli
輕松查看。通過在添加項(xiàng)目之間從終端運(yùn)行 redis-cli get stale-filter
,您將看到各個(gè)字節(jié)增加。完整的過濾器將為每個(gè)字節(jié) \xff
。此時(shí),過濾器將始終返回正值。
結(jié)論
布隆過濾器并不是萬能的解決方案,但在適當(dāng)?shù)那闆r下,布隆過濾器可以為其他數(shù)據(jù)結(jié)構(gòu)提供快速、有效的補(bǔ)充。
如果您仔細(xì)注意開發(fā)工具中的往返時(shí)間,您會發(fā)現(xiàn)使用用戶名請求單個(gè)路徑的次數(shù)越多,所需的時(shí)間就越長。雖然檢查過濾器需要固定的時(shí)間,但在本例中,我們正在檢查是否存在更多項(xiàng)目。布隆過濾器能夠告訴您的信息有限,因此您正在測試每個(gè)項(xiàng)目是否存在。當(dāng)然,在我們的示例中,它相當(dāng)簡單,但測試數(shù)百個(gè)項(xiàng)目效率很低。
The above is the detailed content of Explore the power of Bloom Filters using Node.js and Redis. For more information, please follow other related articles on the PHP Chinese website!

Hot AI Tools

Undress AI Tool
Undress images for free

Undresser.AI Undress
AI-powered app for creating realistic nude photos

AI Clothes Remover
Online AI tool for removing clothes from photos.

Clothoff.io
AI clothes remover

Video Face Swap
Swap faces in any video effortlessly with our completely free AI face swap tool!

Hot Article

Hot Tools

Notepad++7.3.1
Easy-to-use and free code editor

SublimeText3 Chinese version
Chinese version, very easy to use

Zend Studio 13.0.1
Powerful PHP integrated development environment

Dreamweaver CS6
Visual web development tools

SublimeText3 Mac version
God-level code editing software (SublimeText3)

The main reasons why WordPress causes the surge in server CPU usage include plug-in problems, inefficient database query, poor quality of theme code, or surge in traffic. 1. First, confirm whether it is a high load caused by WordPress through top, htop or control panel tools; 2. Enter troubleshooting mode to gradually enable plug-ins to troubleshoot performance bottlenecks, use QueryMonitor to analyze the plug-in execution and delete or replace inefficient plug-ins; 3. Install cache plug-ins, clean up redundant data, analyze slow query logs to optimize the database; 4. Check whether the topic has problems such as overloading content, complex queries, or lack of caching mechanisms. It is recommended to use standard topic tests to compare and optimize the code logic. Follow the above steps to check and solve the location and solve the problem one by one.

Miniving JavaScript files can improve WordPress website loading speed by removing blanks, comments, and useless code. 1. Use cache plug-ins that support merge compression, such as W3TotalCache, enable and select compression mode in the "Minify" option; 2. Use a dedicated compression plug-in such as FastVelocityMinify to provide more granular control; 3. Manually compress JS files and upload them through FTP, suitable for users familiar with development tools. Note that some themes or plug-in scripts may conflict with the compression function, and you need to thoroughly test the website functions after activation.

Methods to optimize WordPress sites that do not rely on plug-ins include: 1. Use lightweight themes, such as Astra or GeneratePress, to avoid pile-up themes; 2. Manually compress and merge CSS and JS files to reduce HTTP requests; 3. Optimize images before uploading, use WebP format and control file size; 4. Configure.htaccess to enable browser cache, and connect to CDN to improve static resource loading speed; 5. Limit article revisions and regularly clean database redundant data.

TransientsAPI is a built-in tool in WordPress for temporarily storing automatic expiration data. Its core functions are set_transient, get_transient and delete_transient. Compared with OptionsAPI, transients supports setting time of survival (TTL), which is suitable for scenarios such as cache API request results and complex computing data. When using it, you need to pay attention to the uniqueness of key naming and namespace, cache "lazy deletion" mechanism, and the issue that may not last in the object cache environment. Typical application scenarios include reducing external request frequency, controlling code execution rhythm, and improving page loading performance.

The most effective way to prevent comment spam is to automatically identify and intercept it through programmatic means. 1. Use verification code mechanisms (such as Googler CAPTCHA or hCaptcha) to effectively distinguish between humans and robots, especially suitable for public websites; 2. Set hidden fields (Honeypot technology), and use robots to automatically fill in features to identify spam comments without affecting user experience; 3. Check the blacklist of comment content keywords, filter spam information through sensitive word matching, and pay attention to avoid misjudgment; 4. Judge the frequency and source IP of comments, limit the number of submissions per unit time and establish a blacklist; 5. Use third-party anti-spam services (such as Akismet, Cloudflare) to improve identification accuracy. Can be based on the website

When developing Gutenberg blocks, the correct method of enqueue assets includes: 1. Use register_block_type to specify the paths of editor_script, editor_style and style; 2. Register resources through wp_register_script and wp_register_style in functions.php or plug-in, and set the correct dependencies and versions; 3. Configure the build tool to output the appropriate module format and ensure that the path is consistent; 4. Control the loading logic of the front-end style through add_theme_support or enqueue_block_assets to ensure that the loading logic of the front-end style is ensured.

To add custom user fields, you need to select the extension method according to the platform and pay attention to data verification and permission control. Common practices include: 1. Use additional tables or key-value pairs of the database to store information; 2. Add input boxes to the front end and integrate with the back end; 3. Constrain format checks and access permissions for sensitive data; 4. Update interfaces and templates to support new field display and editing, while taking into account mobile adaptation and user experience.

robots.txt is crucial to the SEO of WordPress websites, and can guide search engines to crawl behavior, avoid duplicate content and improve efficiency. 1. Block system paths such as /wp-admin/ and /wp-includes/, but avoid accidentally blocking the /uploads/ directory; 2. Add Sitemap paths such as Sitemap: https://yourdomain.com/sitemap.xml to help search engines quickly discover site maps; 3. Limit /page/ and URLs with parameters to reduce crawler waste, but be careful not to block important archive pages; 4. Avoid common mistakes such as accidentally blocking the entire site, cache plug-in affecting updates, and ignoring the matching of mobile terminals and subdomains.
