Remove duplicated data, how to make it faster [on hold]

问题内容:

Need to deal with a big long type array (100 – 200million elements) frequently. A data in the array will repeat a few times or thousands of times before next data. I need to remove the repeated data.
What I do now is take in a data, compare it to next, until the next data is not same. It takes 1.5s to complete one million elements. 100 seconds to complete 100 Million elements. How to do it faster.

问题评论:

1  
split the big chunk of data and give it to multiple threads to remove the duplicates in chunk, using the merge sort technique put back the chunk together removing duplicate between the chunks.
1  
See if std::unique can do it faster than you can.
    
I would recommend using std::vector. Vectors are easily resizeable, come with quite a few functions, and you can refer to them like arrays whenever you feel like it.
1  
I believe a hash set would come in handy here even with a single thread of execution.
    
Do you mean an array of typelong, or some other type that is bigger? Only 1 million comparisons a second doesn’t sound very efficient — you should be getting more than that. How much memory does the array with duplicates occupy? How much main memory do you have available on the machine? Are you storing the deduplicated data into the source array, or in a separate array? Are you coding in C or C++? You shouldn’t dual tag questions; the solution options in C++ are very different from those for C. Without any code on show, it is hard to know what you’re doing. Provide an MCVE (Minimal, Complete, and Verifiable example).

原文地址:

https://stackoverflow.com/questions/47747298/remove-duplicated-data-how-to-make-it-faster

添加评论