Tests with The One Billion Row Challenge (1BRC)

Have you ever ventured into the realm of reading massive files with native Java code? How about tackling a challenge that involves dealing with a staggering 1,000,000,000 rows? That’s precisely what Gunnar Morling’s One Billion Row Challenge invites you to explore. The challenge remains open until January 31, 2024, see the official repository for more information.

Your mission, should you choose to accept it, is to craft a Java program capable of extracting temperature measurement values from a text file and computing the minimum, mean, and maximum temperatures per weather station. The catch? The file is a behemoth, boasting a whopping 1,000,000,000 rows!

The challenge repository has already garnered over 100 submissions, showcasing a myriad of ingenious approaches to achieve optimal performance. Strategies include delving into Unsafe mode, manipulating bytes, employing Threads, Virtual Threads, and leveraging the Arena API to manage memory segment lifecycles efficiently. The creative solutions span a spectrum of techniques, such as flexible allocation and timely deallocation.

Over the past two days, I’ve dedicated my efforts to implementing a solution that meets the baseline values of around 4 minutes (spoiler: the faster that got was 2 min). Along this journey, I’ve uncovered some valuable insights. For instance, when dealing with large files using native Java features (without Unsafe, etc.), the most effective method involves using Files.lines(Paths.get(FILE)). In my tests, employing the Stream API in parallel yielded even better results.

[java]
private static void run() throws IOException {
    ConcurrentHashMap<String, Result> data = new ConcurrentHashMap<>();

    // Processing time: 122,740 ms
    try (var lines = Files.lines(Paths.get(FILE)).parallel()) {
        lines.forEach(line -> {
            String[] split = line.split(";");
            data.computeIfAbsent(split[0], r -> new Result()).calculate(split[1]);
        });
    }

    System.out.println(new TreeMap<>(data));
}

Attempts to use Files.readAllLines(null) or collect lines into a list for parallel processing proved challenging due to the inherent limitations when handling large files. The Stream API emerges as the recommended approach for efficiently processing substantial datasets.

I also experimented with Virtual Threads, but surprisingly, I observed no performance improvement. In fact, in my tests, Virtual Threads in conjunction with parallel processing fared worse than the Stream API parallel approach.

[java]
private static void run3() throws IOException {
    ConcurrentHashMap<String, Result> data = new ConcurrentHashMap<>();

    // Processing time: 194,468 ms
    try (var lines = Files.lines(Paths.get(FILE)).parallel()) {
        lines.forEach(line -> {
            Thread.startVirtualThread(() -> {
                String[] split = line.split(";");
                data.computeIfAbsent(split[0], r -> new Result()).calculate(split[1]);
            });
        });
    }

    System.out.println(new TreeMap<>(data));
}

Interestingly, some of the swifter implementations employ Unsafe mode, involving byte chunking, reading, and performing calculations, as seen in Quan Anh Mai’s implementation. When Attempts to adapt this code for Virtual Threads did not yield a significant improvement. From

[java]
var thread = new Thread(() -> resultList[index] = processFile(data, offset, limit));

//TO

threadList[index] = Thread.ofVirtual().start(() -> {
resultList[index] = processFile(data, offset, limit);
});

Another change that I’ve made was changing from:

[java] 
target.compute(key, (k, v) -> {
if (v == null) {
v = new Aggregator();
}

v.min = Math.min(v.min, UNSAFE.getShort(this.data, baseOffset + MIN_OFFSET));
v.max = Math.max(v.max, UNSAFE.getShort(this.data, baseOffset + MAX_OFFSET));
v.sum += UNSAFE.getLong(this.data, baseOffset + SUM_OFFSET);
v.count += UNSAFE.getLong(this.data, baseOffset + COUNT_OFFSET);
return v;
});

to

[java]
target.computeIfAbsent(key, k -> new Aggregator()).calculate(this.data, baseOffset);

But, again, I haven’t notice any significant improvement using computeIfAbsent instead of the compute(…) and the if (v == null) {

Of course, the Aggregator was implementing all the min, max etc calculation with my change:

public void calculate(Object data, long baseOffset) {
            min = Math.min(min, UNSAFE.getShort(data, baseOffset + 4));
            max = Math.max(max, UNSAFE.getShort(data, baseOffset + 6));
            sum += UNSAFE.getLong(data, baseOffset + 8);
            count += UNSAFE.getLong(data, baseOffset + 16);
        }

Anyway, the journey has been enlightening, and you can explore my implementation at this GitHub repository. While I’m unsure if I’ll submit a pull request given the abundance of impressive code, the experience has been enriching, and I’ve gleaned valuable insights along the way.

Deixe um comentário

Este site utiliza o Akismet para reduzir spam. Saiba como seus dados em comentários são processados.