亚洲国产日韩欧美一区二区三区,精品亚洲国产成人av在线,国产99视频精品免视看7,99国产精品久久久久久久成人热,欧美日韩亚洲国产综合乱

java - 面試問(wèn)題:查找出現(xiàn)次數(shù)最多的ip
巴扎黑
巴扎黑 2017-04-18 10:31:54
0
9
768

就是 現(xiàn)在有一個(gè)log 里面全都是訪問(wèn)者的ip(數(shù)據(jù)很大很大) 一行一個(gè)ip
比如
123.22.22.187
123.22.22.182
123.22.21.187
123.22.42.187
123.23.22.187
123.22.22.182
121.22.22.182
....

怎么查出來(lái)訪問(wèn)次數(shù)最多的10個(gè)IP?

數(shù)據(jù)量非常大 不可能直接讀進(jìn)來(lái) 然后排序。。。

方式:
第一步:
如下讀?。?br>聲明一個(gè)變量 dictionary <string ,long> record
123.22.22.187
123.22.22.182
123.22.21.187
123.22.42.187
123.23.22.187
123.22.22.182
121.22.22.182
先與藍(lán)色部分為key ,key 出現(xiàn)的次數(shù)為value 存在 record中。
這樣 ,record的最大個(gè)數(shù)是255個(gè),不算多,更說(shuō)不上要多少內(nèi)存。
再來(lái)一次大掃除,
使record 剩下 最多的 10 個(gè)元素
以這10 個(gè)元素為條件,再掃描一次表,把符合條件的數(shù)據(jù)寫(xiě)到另一個(gè)[文件A]中,

第二步:
掃描 [文件A] 以下面數(shù)據(jù)藍(lán)色部分為key ,以出現(xiàn)次數(shù)為value 寫(xiě)到 變量 dictionary <string ,long> record 中
123.22.22.187
123.22.22.182
123.22.21.187
123.22.42.187
123.23.22.187
123.22.22.182
121.22.22.182
。。。
第三步:
。。。
第四步:
。。。
就這樣,
用4 步相同的方法,提出最多的數(shù)據(jù):10條,

問(wèn)題,上面這種方式感覺(jué)不行啊,是我沒(méi)理解嗎??
比如說(shuō):
123,125,196,158,假設(shè)這個(gè)ip地址出現(xiàn)的次數(shù)最多,那么如下,
121,123
121,14
121,56
123,125,196,158,
123,125,196,158,

如果分段比較次數(shù),那么123,125,196,158, 這個(gè)ip在第一步就被淘汰了?

巴扎黑
巴扎黑

reply all(9)
伊謝爾倫

The correct answer is this: sort|uniq -c

大家講道理

a.b.c.d

Deal with a first.

All files are divided into multiple reads (such as 255 times).

The records are then stored in 255 files, corresponding to a (such as 1.b.c.d; 2.b.c.d;)

This way you can get the largest ten.

Next, deal with b.

. . .

Finally got the answer.

伊謝爾倫

This kind of problem can use max-min heap.

巴扎黑

Basic idea

  1. 對(duì)所有的ip進(jìn)行分組,統(tǒng)計(jì)出現(xiàn)次數(shù),ip做key,出現(xiàn)次數(shù)做val,保存在關(guān)聯(lián)數(shù)組中

  2. 數(shù)組按照鍵值(出現(xiàn)次數(shù))降序排序,取出前 10 條記錄

Although it’s not java語(yǔ)言描述的,但javascript描述, I should be able to understand it. Anyway, I have tested it and it can achieve your effect:

The following is the complete example code:

    var data = [123.22.22.187 , 123.22.22.187 ....];  // 等價(jià)于日志中每行ip
    var data = getCount(data); // 對(duì)所有的ip進(jìn)行分組,統(tǒng)計(jì)出現(xiàn)次數(shù)
                               // ip做key,出現(xiàn)次數(shù)做val,保存在關(guān)聯(lián)數(shù)組中
    var rel  = getRecord(data , 10 , 'desc'); // 這個(gè)就是你想的結(jié)果了
    
    /*
     * ip 分組并統(tǒng)計(jì)次數(shù)的具體實(shí)現(xiàn)
     * @param Array data  日志文件中存放的那些所有ip組成的數(shù)組
     * @return Object
     */
    function getCount(data){
        var rel= {}; // 根據(jù)ip進(jìn)行分組后的結(jié)果,由于js沒(méi)有關(guān)聯(lián)數(shù)組
                      // 所以用對(duì)象替換
        var i;
        var count;
        var n;
        var cur;
        var curC;
        // 判斷是否已經(jīng)統(tǒng)計(jì)過(guò)了的
        var checkIsCount = function(ip){
            var k;
            
            for (k in rel)
                {
                    if (k === ip) {
                        return true;
                    }
                }
            
            return false;
        };
        
        // 每次固定一個(gè) ip
        // 先判斷當(dāng)前 ip 在 rel 中是否已經(jīng)存在(排除已被統(tǒng)計(jì)的ip,避免重復(fù)統(tǒng)計(jì))
        // 不存在的時(shí)候,開(kāi)始統(tǒng)計(jì)其出現(xiàn)次數(shù)
        for (i = 0; i < data.length; ++i)
            {
                cur = data[i];
                count = 1;
                
                if (!checkIsCount(cur)) {
                    for (n = i + 1; n < data.length; ++n)
                        {
                            curC = data[n];
    
                            if (cur === curC) {
                                count++;
                            }    
                        }
    
                     rel[cur] = count;      
                }
            }
            
       // 保存了所有不同ip,及其出現(xiàn)次數(shù)的對(duì)象
       console.log('ip進(jìn)行了分組并統(tǒng)計(jì)后的結(jié)果:' , rel);
       return rel;
    }
    
    /*
     * 數(shù)組按照鍵值(出現(xiàn)次數(shù))降序排序,取出前 len 條記錄
     * @param Object data 分組并統(tǒng)計(jì)出現(xiàn)次數(shù)后的對(duì)象
     * @param Number len  取出多少條
     * @param String sortType 升序還是降序(asc | desc)
     * @return Array
     */
    function getRecord(data , len , sortType){
        var rel = {};
        var result = [];
        var k;
        var k1;
        var curKey;
        var count = 0;
        // 檢查 ip 是否已被獲取
        var checkSorted = function(ip){
            var k;
            
            for (k in rel)
                {
                    if (k === ip) {
                        return true;
                    }
                }
                
            return false;
        };
    
        // 排序算法:選擇排序
        for (k in data)
            {
                
               if (count >= len) {
                   break;
               }
               
               curKey = k;
               
               if (!checkSorted(k)) {
                   for (k1 in data) 
                       {
                           if (!checkSorted(k1)) {
                              if (sortType === 'asc' ? data[curKey] > data[k1] : data[curKey] < data[k1]) {
                                  curKey = k1;
                              } 
                           }
                       }
                   
                   
                   // 按照鍵值排序后的最大值|最小值
                   rel[curKey] = data[curKey];
    
                   // 返回可視化結(jié)果
                   result.push(curKey);
                   
                   // 獲取的長(zhǎng)度增加
                   count++
               }
            }
    
        // 按照鍵值排序后的 len 長(zhǎng)度的記錄
        console.log('取出指定長(zhǎng)度記錄的實(shí)際結(jié)果:' , rel);
        console.log('取出指定長(zhǎng)度記錄的可視化結(jié)果:' , result);
        return result;
    }
黃舟

map-reduce or sort | uniq -c
Obviously use hadoop (map-reduce)?

PHPzhong

Sort it first, it will be easier to deal with after sorting.

Ty80

This question should be broken into two parts to understand,
1. Reading a large file at once is inefficient
2. Find the top 10 records with the most occurrences
Question 1: For reading large files, you can use java AIO Implementation method reads files asynchronously
Question 2: LindedList needs to be extended. Each time a record is appended, first check whether a record exists, and if so, push the record position forward (this is more similar to the java memcached caching algorithm)

巴扎黑

These types of problems all have the same idea, "divide and conquer".
Each IP is equivalent to a 32-bit integer number. Traverse all the IPs in the log, divide them into n files according to the value range and save them (make sure each file is small enough). In this case, the same IP will definitely be assigned to the same file. In a file; then process these n files one by one, count the top 10 IPs that appear in each file; finally get the top 10 from these 10*n IPs.

左手右手慢動(dòng)作

Use regular expressions to determine the first number at the beginning of the IP:

1.2.3\n
4.5.6\n
1.2.3\n

The number at the beginning of the line that appears the most is the most frequent IP

Latest Downloads
More>
Web Effects
Website Source Code
Website Materials
Front End Template