How to remove duplicate lines in a large text file?

2022-04-03 00:09:07

How would you remove duplicate lines from a file that is much too large to fit in memory? The duplicate lines are not necessarily adjacent, and say the file is 10 times bigger than RAM.

A better solution is to use HashSet to store each line of input.txt. As set ignores duplicate values, so while storing a line, check if it already present in hashset. Write it to output.txt only if not present in hashset.

Java:

// Efficient Java program to remove

// duplicates from input.txt and

// save output to output.txt 

import java.io.*;

import java.util.HashSet; 

public class FileOperation

{

    public static void main(String[] args) throws IOException

    {

        // PrintWriter object for output.txt

        PrintWriter pw = new PrintWriter("output.txt"); 

        // BufferedReader object for input.txt

        BufferedReader br = new BufferedReader(new FileReader("input.txt")); 

        String line = br.readLine(); 

        // set store unique values

        HashSet<String> hs = new HashSet<String>(); 

        // loop for each line of input.txt

        while(line != null)

        {

            // write only if not

            // present in hashset

            if(hs.add(line))

                pw.println(line); 

            line = br.readLine(); 

        } 

        pw.flush(); 

        // closing resources

        br.close();

        pw.close(); 

        System.out.println("File operation performed successfully");

    }

}

码农公寓

相关文章