Which is faster to delete first line in file... sed or tail?
In this answer (How can I remove the first line of a file with sed?) there are two ways to delete the first record in a file:
sed '1d' $file >> headerless.txt** ---------------- OR ----------------**
tail -n +2 $file >> headerless.txtPersonally I think the tail option is cosmetically more pleasing and more readable but probably because I'm sed-challenged.
Which method is fastest?
126 Answers
Performance of sed vs. tail to remove the first line of a file
TL;DR
sedis very powerful and versatile, but this is what makes it slow, especially for large files with many lines.taildoes just one simple thing, but that one it does well and fast, even for bigger files with many lines.
For small and medium sized files, sed and tail are performing similarly fast (or slow, depending on your expectations). However, for larger input files (multiple MBs), the performance difference grows significantly (an order of magnitude for files in the range of hundreds of MBs), with tail clearly outperforming sed.
Experiment
General Preparations:
Our commands to analyze are:
sed '1d' testfile > /dev/null
tail -n +2 testfile > /dev/nullNote that I'm piping the output to /dev/null each time to eliminate the terminal output or file writes as performance bottleneck.
Let's set up a RAM disk to eliminate disk I/O as potential bottleneck. I personally have a tmpfs mounted at /tmp so I simply placed my testfile there for this experiment.
Then I am once creating a random test file containing a specified amount of lines $numoflines with random line length and random data using this command (note that it's definitely not optimal, it becomes really slow for about >2M lines, but who cares, it's not the thing we're analyzing):
cat /dev/urandom | base64 -w0 | tr 'n' '\n'| head -n "$numoflines" > testfileOh, btw. my test laptop is running Ubuntu 16.04, 64 bit on an Intel i5-6200U CPU. Just for comparison.
Timing big files:
Setting up a huge testfile:
Running the command above with numoflines=10000000 produced a random file containing 10M lines, occupying a bit over 600 MB - it's quite huge, but let's start with it, because we can:
$ wc -l testfile
10000000 testfile
$ du -h testfile
611M testfile
$ head -n 3 testfile
qOWrzWppWJxx0e59o2uuvkrfjQbzos8Z0RWcCQPMGFPueRKqoy1mpgjHcSgtsRXLrZ8S4CU8w6O6pxkKa3JbJD7QNyiHb4o95TSKkdTBYs8uUOCRKPu6BbvG
NklpTCRzUgZK
O/lcQwmJXl1CGr5vQAbpM7TRNkx6XusYrOPerform the timed run with our huge testfile:
Now let's do just a single timed run with both commands first to estimate with what magnitudes we're working.
$ time sed '1d' testfile > /dev/null
real 0m2.104s
user 0m1.944s
sys 0m0.156s
$ time tail -n +2 testfile > /dev/null
real 0m0.181s
user 0m0.044s
sys 0m0.132sWe already see a really clear result for big files, tail is a magnitude faster than sed. But just for fun and to be sure there are no random side effects making a big difference, let's do it 100 times:
$ time for i in {1..100}; do sed '1d' testfile > /dev/null; done
real 3m36.756s
user 3m19.756s
sys 0m15.792s
$ time for i in {1..100}; do tail -n +2 testfile > /dev/null; done
real 0m14.573s
user 0m1.876s
sys 0m12.420sThe conclusion stays the same, sed is inefficient to remove the first line of a big file, tail should be used there.
And yes, I know Bash's loop constructs are slow, but we're only doing relatively few iterations here and the time a plain loop takes is not significant compared to the sed/tail runtimes anyway.
Timing small files:
Setting up a small testfile:
Now for completeness, let's look at the more common case that you have a small input file in the kB range. Let's create a random input file with numoflines=100, looking like this:
$ wc -l testfile
100 testfile
$ du -h testfile
8,0K testfile
$ head -n 3 testfile
tYMWxhi7GqV0DjWd
pemd0y3NgfBK4G4ho/
aItY/8crld2tZvsU5lyPerform the timed run with our small testfile:
As we can expect the timings for such small files to be in the range of a few milliseconds from experience, let's just do 1000 iterations right away:
$ time for i in {1..1000}; do sed '1d' testfile > /dev/null; done
real 0m7.811s
user 0m0.412s
sys 0m7.020s
$ time for i in {1..1000}; do tail -n +2 testfile > /dev/null; done
real 0m7.485s
user 0m0.292s
sys 0m6.020sAs you can see, the timings are quite similar, there's not much to interpret or wonder about. For small files, both tools are equally well suited.
2Here's another alternative, using just bash builtins and cat:
{ read ; cat > headerless.txt; } < $file$file is redirected into the { } command grouping. The read simply reads and discards the first line. The rest of the stream is then piped to cat which writes it to the destination file.
On my Ubuntu 16.04 the performance of this and the tail solution are very similar. I created a largish test file with seq:
$ seq 100000000 > 100M.txt
$ ls -l 100M.txt
-rw-rw-r-- 1 ubuntu ubuntu 888888898 Dec 20 17:04 100M.txt
$tail solution:
$ time tail -n +2 100M.txt > headerless.txt
real 0m1.469s
user 0m0.052s
sys 0m0.784s
$ cat/brace solution:
$ time { read ; cat > headerless.txt; } < 100M.txt
real 0m1.877s
user 0m0.000s
sys 0m0.736s
$ I only have an Ubuntu VM handy right now though, and saw significant variation in the timings of both, though they're all in the same ballpark.
3Trying in on my system, and prefixing each command with time I got the following results:
sed:
real 0m0.129s
user 0m0.012s
sys 0m0.000sand tail:
real 0m0.003s
user 0m0.000s
sys 0m0.000swhich suggest that, on my system at least AMD FX 8250 running Ubuntu 16.04, tail is significantly faster. The test file had 10,000 lines with a size of 540k. The file was read from a HDD.
4There is no objective way to say which is better, because sed and tail aren't the only things that run on a system during program execution. A lot of factors such as disk i/o, network i/o, CPU interrupts for higher priority processes - all those influence how fast your program will run.
Both of them are written in C, so this is not language issue, but more of environmental one. For example, I have SSD and on my system this will take time in microseconds, but for same file on hard drive it will take more time because HDDs are significantly slower. So hardware plays role in this,too.
There's a few things that you may want to keep in mind when considering which command to choose:
- What is your purpose ?
sedis stream editor for transforming text.tailis for outputting specific lines of text. If you want to deal with lines and only print them out , usetail. If you want to edit the text, usesed. tailhas far simpler syntax thansed, so use what you can read yourself and what others can read.
Another important factor is the amount of data you're processing. Small files won't give you any performance difference. The picture gets interesting when you're dealing with big files. With a 2 GB BIGFILE.txt, we can see that sed has far more system calls than tail, and runs considerably slower.
bash-4.3$ du -sh BIGFILE.txt
2.0G BIGFILE.txt
bash-4.3$ strace -c sed '1d' ./BIGFILE.txt > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ---------------- 59.38 0.079781 0 517051 read 40.62 0.054570 0 517042 write 0.00 0.000000 0 10 1 open 0.00 0.000000 0 11 close 0.00 0.000000 0 10 fstat 0.00 0.000000 0 19 mmap 0.00 0.000000 0 12 mprotect 0.00 0.000000 0 1 munmap 0.00 0.000000 0 3 brk 0.00 0.000000 0 2 rt_sigaction 0.00 0.000000 0 1 rt_sigprocmask 0.00 0.000000 0 1 1 ioctl 0.00 0.000000 0 7 7 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 getrlimit 0.00 0.000000 0 2 2 statfs 0.00 0.000000 0 1 arch_prctl 0.00 0.000000 0 1 set_tid_address 0.00 0.000000 0 1 set_robust_list
------ ----------- ----------- --------- --------- ----------------
100.00 0.134351 1034177 11 total
bash-4.3$ strace -c tail -n +2 ./BIGFILE.txt > /dev/null
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ---------------- 62.30 0.148821 0 517042 write 37.70 0.090044 0 258525 read 0.00 0.000000 0 9 3 open 0.00 0.000000 0 8 close 0.00 0.000000 0 7 fstat 0.00 0.000000 0 10 mmap 0.00 0.000000 0 4 mprotect 0.00 0.000000 0 1 munmap 0.00 0.000000 0 3 brk 0.00 0.000000 0 1 1 ioctl 0.00 0.000000 0 3 3 access 0.00 0.000000 0 1 execve 0.00 0.000000 0 1 arch_prctl
------ ----------- ----------- --------- --------- ----------------
100.00 0.238865 775615 7 total 10 Top answer didn't take disk into account doing > /dev/null
if you have a large file and don't want to create a temporary duplicate on your disk try vim -c
$ cat /dev/urandom | base64 -w0 | tr 'n' '\n'| head -n 10000000 > testfile
$ time sed -i '1d' testfile
real 0m59.053s
user 0m9.625s
sys 0m48.952s
$ cat /dev/urandom | base64 -w0 | tr 'n' '\n'| head -n 10000000 > testfile
$ time vim -e -s testfile -c ':1d' -c ':wq'
real 0m8.259s
user 0m3.640s
sys 0m3.093sEdit: if the file is larger than available memory vim -c doesn't work, looks like its not smart enough to do an incremental load of the file
Other answers show well what is better to create a new file with first line missing. If you want to edit a file as opposed to create a new file though, I bet ed would be faster because it shouldn't create a new file at all. But you have to search how to remove a line with ed because I used it only once.