Searching in a large CSV is as fast as Guava Splitter
Since Java 8 was released I found out that I don't need more than 2MB of Google Guava in my projects as I can replace most of it with plain Java. However, I really liked the good API Splitter
, which was pretty fast at the same time. And most importantly, it divided lazily . It seems to be replaced by Pattern.splitAsStream
. So I prepared a quick test - find the value in the middle of a long string (i.e. dividing the whole string doesn't make sense).
package splitstream;
import com.google.common.base.Splitter;
import org.junit.Assert;
import org.junit.Test;
import java.util.StringTokenizer;
import java.util.regex.Pattern;
import java.util.stream.Collectors;
import java.util.stream.IntStream;
public class SplitStreamPerfTest {
private static final int TIMES = 1000;
private static final String FIND = "10000";
@Test
public void go() throws Exception {
final String longString = IntStream.rangeClosed(1,20000).boxed()
.map(Object::toString)
.collect(Collectors.joining(" ,"));
IntStream.rangeClosed(1,3).forEach((i) -> {
measureTime("Test " + i + " with regex", () -> doWithRegex(longString));
measureTime("Test " + i + " with string tokenizer", () -> doWithStringTokenizer(longString));
measureTime("Test " + i + " with guava", () -> doWithGuava(longString));
});
}
private void measureTime(String name, Runnable r) {
long s = System.currentTimeMillis();
r.run();
long elapsed = System.currentTimeMillis() - s;
System.out.println("Check " + name +" took " + elapsed + " ms");
}
private void doWithStringTokenizer(String longString) {
String f = null;
for (int i = 0; i < TIMES; i++) {
StringTokenizer st = new StringTokenizer(longString,",",false);
while (st.hasMoreTokens()) {
String t = st.nextToken().trim();
if (FIND.equals(t)) {
f = t;
break;
}
}
}
Assert.assertEquals(FIND, f);
}
private void doWithRegex(String longString) {
final Pattern pattern = Pattern.compile(",");
String f = null;
for (int i = 0; i < TIMES; i++) {
f = pattern.splitAsStream(longString)
.map(String::trim)
.filter(FIND::equals)
.findFirst().orElse("");
}
Assert.assertEquals(FIND, f);
}
private void doWithGuava(String longString) {
final Splitter splitter = Splitter.on(',').trimResults();
String f = null;
for (int i = 0; i < TIMES; i++) {
Iterable<String> iterable = splitter.split(longString);
for (String s : iterable) {
if (FIND.equals(s)) {
f = s;
break;
}
}
}
Assert.assertEquals(FIND, f);
}
}
Results (after warm-up)
Check Test 3 with regex took 1359 ms
Check Test 3 with string tokenizer took 750 ms
Check Test 3 with guava took 594 ms
How do I make a Java implementation as fast as Guava? Maybe I am doing it wrong?
Or maybe you know a tool / library as fast as Guava Splitter that doesn't involve pulling in a lot of unused classes just for that?
source to share
First of all, guava is much more than just Splitter
that, Predicate
and Function
- you probably aren't using everything it has to offer; we use it hardcore and just hear what makes me shiver. Anyway, your tests are broken - possibly in multiple ways. I used JMH
to test these two methods just for fun:
@BenchmarkMode(org.openjdk.jmh.annotations.Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Warmup(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 5, time = 2, timeUnit = TimeUnit.SECONDS)
@State(Scope.Thread) public class GuavaTest {
public static void main(String[] args) throws RunnerException {
Options opt = new OptionsBuilder().include(GuavaTest.class.getSimpleName())
.jvmArgs("-ea", "-Xms10g", "-Xmx10g")
.shouldFailOnError(true)
.build();
new Runner(opt).run();
}
@Param(value = { "300", "1000" })
public String tokenToSearchFor;
@State(Scope.Benchmark)
public static class ThreadState {
String longString = IntStream.range(1, 20000).boxed().map(Object::toString).collect(Collectors.joining(" ,"));
StringTokenizer st = null;
Pattern pattern = null;
Splitter splitter = null;
@Setup(Level.Invocation)
public void setUp() {
st = new StringTokenizer(longString, ",", false);
pattern = Pattern.compile(",");
splitter = Splitter.on(',').trimResults();
}
}
@Benchmark
@Fork(1)
public boolean doWithStringTokenizer(ThreadState ts) {
while (ts.st.hasMoreTokens()) {
String t = ts.st.nextToken().trim();
if (t.equals(tokenToSearchFor)) {
return true;
}
}
return false;
}
@Benchmark
@Fork(1)
public boolean doWithRegex(ThreadState ts) {
return ts.pattern.splitAsStream(ts.longString)
.map(String::trim)
.anyMatch(tokenToSearchFor::equals);
}
@Benchmark
@Fork(1)
public boolean doWithGuava(ThreadState ts) {
Iterable<String> iterable = ts.splitter.split(ts.longString);
for (String s : iterable) {
if (s.equals(tokenToSearchFor)) {
return true;
}
}
return false;
}
}
And the results:
Benchmark (tokenToSearchFor) Mode Cnt Score Error Units
GuavaTest.doWithGuava 300 avgt 5 19284.192 ± 23536.321 ns/op
GuavaTest.doWithGuava 1000 avgt 5 67182.531 ± 93242.266 ns/op
GuavaTest.doWithRegex 300 avgt 5 65780.954 ± 169044.641 ns/op
GuavaTest.doWithRegex 1000 avgt 5 182530.069 ± 409571.222 ns/op
GuavaTest.doWithStringTokenizer 300 avgt 5 34111.030 ± 61014.332 ns/op
GuavaTest.doWithStringTokenizer 1000 avgt 5 118963.048 ± 165510.183 ns/op
This makes guava the fastest.
If you add parallel
in splitAsStream
then it gets interesting, must read here
source to share
This might be useful, you can only import the parts you need in guava: https://github.com/google/guava/wiki/UsingProGuardWithGuava
source to share
Can you give pattern.split (text) and repeat the result in a normal loop, try it. It might be faster than a stream. Although I'm not sure if he will beat Guava.
I meant it.
private void doWithRegexAndSplit(String longString) {
final Pattern pattern = Pattern.compile(",");
for (int i = 0; i < TIMES; i++) {
String f = "";
String[] arr = pattern.split(longString);
for (int i = 0; i < arr.length; i++){
String t= arr[i].trim();
if (FIND.equals(t)) {
f = t;
break;
}
}
}
Assert.assertEquals(FIND, f);
}
Please check the completion time for this case.
source to share
You are comparing Pattern.splitAsStream(CharSequence)
to Splitter.split(CharSequence)
to Splitter.on(char)
instead of Splitter.onPattern(String)
. Finding matches against char is much easier computation than searching for pattern matches (regex).
If you use Splitter.onPattern(",").trimResults()
, you will get the following results:
Check Test 3 with regex took 608 ms Check Test 3 with string tokenizer took 403 ms Check Test 3 with guava took 306 ms Check Test 3 with guava pattern took 689 ms
In this case, it Pattern.splitAsStrimg(CharSequence)
actually performs better than the Guava implementation (if that is a valid criterion, which is always dubious because we are not using jmh ).
I am not aware of any char
delimited JDK solution similar to Guava Splitter.on(char).split(CharSequence)
. You can roll your own, but Guava's solution looks very streamlined.
source to share