M BUZZ CRAZE NEWS
// general

Removing a variable pattern from the title of my files

By Emma Martinez

As I finally received data from NGS sequencing, for some days I used Ubuntu to analyse them. However, I lack the basics of shell coding, and I feel overwhelmed by this whole new language.

I managed to follow pipelines, but there are still beginner issues.

Specifically, I have a folder with 96 files that I want to rename. They are typically of the form:

AD18_S1_R2_cat_trimmed.fastq.gz
AD19_S26_R2_cat_trimmed.fastq.gz

Basically, I am trying to delete the sample ID, for instance _S1 and _S26. I recently discovered asterisks, and used them successfully for a previous function. But I have an issue imagining how to use them here. What I think would work is to extract the expression between _S and _R and remove it, while keeping the R.

If the sample ID always had the same length, I would have used [5-7] to remove the characters from the name. But it won't work for some samples.

I want to understand how to do this, more than having the answer. Thus, would you kindly explain me how to make this change, and what does your code mean if you agree to share a solution?

4

3 Answers

mmv is a nice tool for this. It is not installed by default, so you can install it using:

sudo apt install mmv

Then just run the following command in the directory where you keep the files:

mmv -n '*_*_R2_cat_trimmed.fastq.gz' '#1_R2_cat_trimmed.fastq.gz'

A simplified explanation:

  • -n (no-execute) is used so that you get a preview of the changes without them getting applied. If you are satisfied with the output, rerun the command without the -n flag.

  • You want to remove anything you have between the first and the second _, so the first argument of mmv ('*_*_R2_cat_trimmed.fastq.gz') is a general expression for your files.

    The star is a wildcard that means "match any string of characters". So we match any string up to the first _, then any string between the first and second _, and we leave the rest of the file name as is.

  • The second argument ('#1_R2_cat_trimmed.fastq.gz') basically says "rename using the first match" (#1) and the rest is just the part of the string that we leave as is. Since we didn't use the second match (#2), we effectively removed it.

By default, mmv applies the changes in the background. If you want to see the changes while being made, you can use the -v (verbose) flag.

For more info about mmv, you can consult its manpage, by running man mmv in your terminal.

Note: Before running any command, always test it in a portion of your files to make sure that it works as you want and you don't lose any files. It's also good to always keep a backup of the original files.

1

Use rename:

rename -n 's/^([^_]*)_S[^_]*(_.*)$/$1$2/' *.fastq.gz

Remove the -n if you're happy with the output.

perl rename tool uses perl regular expressions similar to sed:

  • s/pattern/replacement/modifier will search for pattern and replace it with replacement.
  • pattern is a regular expression to match your string.
  • ^([^_]*)_ => Search from beginning of the filename (^) anything up to the first _ ([^_]*) and save it as variable (...) to be used in the replacement ($1).
  • S[^_]* => Match an S followed by anything that is not an _.
  • (_.*)$ Match an _ followed by anything (.*) (save a again as variable $2) until the end of the string ($).
1

In general, when working with regular expressions, "less is more". I mean, avoid trying to match the entire string, instead focus on matching just the bit you are insterested in. In this case, the letter S followed by one or more numbers. That can be expressed in Perl Compatible Regular Expression (PCRE) syntax, the syntax used by the rename tool, as S\d+.

With this in mind, here's a simpler rename command:

$ rename -n 's/_S\d+//' *fastq.gz
AD18_S1_R2_cat_trimmed.fastq.gz -> AD18_R2_cat_trimmed.fastq.gz
AD19_S26_R2_cat_trimmed.fastq.gz -> AD19_R2_cat_trimmed.fastq.gz

That command will replace the first instance of a _ followed by S and then one or more numbers (\d+) with nothing, so effectively removing it. The -n tells rename not to rename anything but only to print what it would do. Once you've confirmed this does what you want, remove the -n to make the command actually rename the files.

Your Answer

Sign up or log in

Sign up using Google Sign up using Facebook Sign up using Email and Password

Post as a guest

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy