RSS

Having a Bash at Python

27 May

I’ve written elsewhere about Python, the first and only programming language deliberately written to be user-friendly, and I thought I should make public a couple of programs I’ve written recently for some fairly specialised tasks. One is pure Bash, the other is a combination of Bash and Python, which provides the best of both worlds — Bash to do the heavy lifting, and Python as a user-friendly matrix with which to handle the housekeeping. I’ll discuss each one in turn below. Naturally, to use the Bash commands in the scripts you will have to have those commands installed in Linux Mint. There are also some non-standard Python libraries used in the programs — see here for details of how to install those using Pip.

I. Converting PDF files to monochrome — a Bash script

 

Various sites on the Web have PDF eBooks available for downloading — for instance, the Open Library has a large collection of out-of-copyright books. These are usually scanned from the paper originals, saved as PDF files, and then put through an OCR process to produce much smaller EPUB files. Unfortunately OCR is not error-proof, and the resulting EPUB files have a number of irritating mistakes in them. I prefer to read the PDF originals, but these can be very big files — for instance, the University Society edition of David Copperfield (1908) from the page cited above is over 94Mb. That’s not a problem with storing and reading it on a PC, but it makes it slow to upload to cloud storage and download to Android tablets and other reading devices.

One reason for the large size of some of these files is that they have been scanned in full colour. I have found that converting the pages to monochrome can reduce the file size by up to 90%, without affecting readability. There are various ways to do this already available, including printing from a PDF reader like Acrobat, and proprietary programs for Windows, but I couldn’t find anything that gave me the results I needed, so I put a program together myself. The basic schema was fairly simple, but it took a lot of tweaking at the end to get the right sort of processing done to the pages in the right order. For various reasons the conversion doesn’t always produce acceptable results, so the user has an option to abandon the modified file and retain the original. It also throws up error messages sometimes, for no apparent reason, but it always seems to go ahead and do the job anyway.

Here it is, with comments:

#!/bin/sh
# Script Name: PDFtoMonochrome.sh
for f in *.pdf
# Steps through all PDF files in the directory
do
   # Show me a page from within the PDF so I can see how dark it is
   evince –page-index=45 -f “$f”
   # Ask the user to set the white threshold
   read -p “White level (20%-50% — lower for dark pages, higher for light ones): ” WhiteLevel
   # Remove the temporary directory from last time
   rm tifftemp/*.*
   rmdir tifftemp
   # Make a temporary directory
   mkdir tifftemp
   # Split the PDF file into separate numbered pages in the temporary directory
   pdftk “$f” burst output tifftemp/page_%04d.pdf
   # Go through each PDF page file in turn
   for file in tifftemp/*.pdf
   do
      # Make it into a full-colour tiff file
      convert -density 200 “${file}” “${file}”.tiff
      # Let the user know something’s happening
      echo “${file} converted to tiff”
   done
   # Go through each tiff file in turn
   for file in tifftemp/*.tiff
   do
      # Sharpen it a bit
      convert -sharpen 0x3 “${file}” “${file}”
      # All shades lighter than the white threshold become white
      convert -white-threshold $WhiteLevel “${file}” “${file}”
      # All shades at 70% or more dark become black
      convert -black-threshold 70% “${file}” “${file}”
      # Change the file palette to monochrome
      convert -monochrome “${file}” “${file}”
      # Compress the monochrome page
      convert -compress lzw “${file}” “${file}”
      # Convert it back to a PDF page
      tiff2pdf -o “${file}”.pdf “${file}”
      # Tell the user
      echo “${file} done”
   done
   # Make a new PDF out of the monchrome pages and call it Mono-<filename>
   pdftk tifftemp/*.tiff.pdf cat output Mono-“$f”
   # Remove the temporary directory
   rm tifftemp/*.*
   rmdir tifftemp
   # Show the user the results and let them decide whether to replace the original
   evince –page-index=45 -f Mono-“$f”
   read -p “Do you want to use monochrome PDF? [Y/N]: ” KeepMono
   if [ $KeepMono = “Y” ] || [ $KeepMono = “y” ]
   then
      # Delete the original PDF and replace it with the renamed monochrome file
      rm “$f”
      mv Mono-“$f” “$f”
   else
      # Keep the original and discard the monochrome file
      rm Mono-“$f”
   fi

# Ready for next PDF
done

On my PC the conversion takes around ten seconds per page, so it’s best left to run in the background. Evince will pop up with the finished version in whatever workspace you happen to be using, so that will alert you to the job being complete.

II. A Python/Bash hybrid — adding a spoken file name at the beginning of MP3 files

 

A little backstory here — I go regularly to a gym, where I use earbuds attached to a tiny clip-on MP3 player. I listen to BBC radio comedy shows and panel games downloaded from the BBC site and converted to MP3 files with get-iplayer. The player plays shows in some order that I don’t really understand, and I often find myself listening to the same show several times. Unfortunately there’s often a lead-in of several minutes on the track before the show begins, and I sometimes found that I’d wasted four or five minutes waiting for the start of a show I’d already heard. So, I thought, what if I could write a program to read the filename out loud, save that as an MP3 file, and tack it on to the beginning of the file containing the show? Since files typically arrive with filenames like

The_Hitchhikers_Guide_to_the_Galaxy_-_Secondary_Phase_6._Fit_the_Twelfth_b007jm6g_default

that would tell me immediately which episode I was about to listen to.

So here it is:

#!/usr/bin/env python
#coding:utf-8;
#Speakfilenames.py

#Import necessary Python libraries

from espeak import espeak
import os, sys, subprocess, string, re, glob, shutil
from pydub import AudioSegment

#Make a temporary directory

if not os.path.exists(“./NameClips”):

os.mkdir(“./NameClips”)

#Tidy up file names — get rid of weird suffix characters apparently used by BBC for record-keeping

for fname in glob.glob(“*.mp3”):

text = str(fname)
text1=re.sub(‘_b0.*’, ‘.mp3’, text)
os.rename(text,text1)

#Create a WAV file in the temporary directory for each spoken MP3 filename in this directory, using the Bash ‘espeak‘ command

for fname in glob.glob(“*.mp3”):

text = str(fname)
subprocess.call(‘espeak -s 100 -w NameClips/’+ text + ‘.wav ‘ + text, shell=True)

# Convert the WAV files to MP3 and remove the originals
os.chdir(“./NameClips”)
for fname in glob.glob(“*.wav”):

text = str(fname)
AudioSegment.from_wav(text).export(text + ‘.mp3’, format=”mp3″, bitrate=”32k”)
os.remove(fname)

# Resample and adjust the rate of the MP3 files using the Bash ‘lame’ command — this is necessary, otherwise they muck up the speed of the radio shows

for fname in glob.glob(“*.mp3”):

text = str(fname)
print(text)
bashargs = “lame –resample 44 –preset cbr 32 ” + text
p = subprocess.Popen(bashargs, shell=True)
p.wait()
os.remove(text)

# Combine the filename files with the radio show files using the Bash ‘mp3wrap’ command

path = “..”
os.chdir(path)
for fname in glob.glob(“*.mp3”):

text1 = str(fname)
text2 = “NameClips/” + text1 + “.wav.mp3.mp3”
print(text1)
print(text2)
subprocess.call(‘mp3wrap ‘ + text1 + ‘ ‘ + text2 + ‘ ‘ + text1, shell=True)
os.remove(text1)

# Remove temporary directory
shutil.rmtree(‘./NameClips’)

# Tidy up file names, removing suffixes added by mp3wrap process
for fname in glob.glob(“*.mp3”):

text = str(fname)
text1=re.sub(‘_MP3WRAP’, ”, text)
os.rename(text,text1)

So there it is — I probably could have combined some of the loops, but this way I found it easier to keep track of what was going on. Once again, this takes a while and can be run in the background while you do something else. If you want to use it on something other than BBC shows you may need to tinker with the bandwidth and sampling rates. You could also remove the first ‘Tidy up file names’ section, though it shouldn’t do any harm to leave it there.

If you use or adapt either of these, let me know. I take no responsibility for the consequences, however.

Advertisements
 
Leave a comment

Posted by on May 27, 2014 in Python

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

 
%d bloggers like this: