Download an entire website with wget

I often find some web directories full of shit loads of music, tv series, movies, pdf, pictures and porn. With IDM I can download all the files which are on the same page with a single click (right click>Download all links with IDM).

But in case of a huge site, with multiple pages and folders, it becomes clumsy and downloading all those things manually with IDM is just not enough.

download_all_the_things
Even sometimes I did it with endless passion and enthusiasm, but another problem arises when browsing the top level folders of a site.

Screenshot from 2013-11-26 02:20:59

For example, lets say, http://vazor.com/drop/bb contains some episodes of Breaking Bad. By clicking up (../) , i can go to the parent folder http://vazor.com/drop and see many of the other folders, not as an html webpage.

Screenshot from 2013-11-26 02:18:54

But if i click up (../) again, it takes me to the http://vazor.com and that is displayed as a webpage, not as a directory. But there might be other directories under vazor.com folder, (for example, http://vazor.com/music or http://vazor.com/movies containing some other interesting stuff).

So, if i want to to know what are the other top level directories under http://vazor.com, there is no way to find that out. At least, I didnt know of any.

Many years ago, i asked one of my teachers, how to do it. He said, there is obviously a way, but he won’t tell me because it opens the possibility to violate the copyright issues.😀

Anyway, in Linux, there is a tool called “wget” with which you can download literally whatever you want, even a whole website!!! Thanks to this article and this article in stackoverflow.

So, in summary what i wanted to do is, to download everything from a website. To accomplish that, what i can do is to generate a command with wget to:

  • download it for me,
  • don’t download the parent folders,
  • specify how many levels deep i want to download,
  • specify what are the type of files i want to download, or what type of files i don’t want,
  • specify the file size
  • provide conditions to download based on file type, file size, file name, location, domain etc,
  • download in a single folder or download exactly the way it was on the server,
  • and finally make the downloaded websites browsable by converting the links.

I don’t remember all the syntax and commands, but you can always look it up here and here.

There is also another similar thing called curl, i don’t know much about it, but you can read about it here and here.

To wrap it up, what i actually did is, type this command on terminal.

wget -r -P /home/duck/Downloads -A mp4 --no-parent http://vazor.com/drop/bb/

which gave me the directory structure of the site including the files I wanted. So, i should have used a --no-parent command with that.

wget -r -P /home/duck/Downloads -A mp4 --no-parent http://vazor.com/drop/bb/

syntax:

wget -r -P /save/location -A jpeg,jpg,bmp,gif,png --no-parent http://www.domain.com

More information on wget:

  • -r enables recursive retrieval. See Recursive Download for more information.
  • -P sets the directory prefix where all files and directories are saved to.
  • -A sets a whitelist for retrieving only certain file types. Strings and patterns are accepted, and both can be used in a comma separated list (as seen above). See Types of Files for more information.

By the way, if you really want to download an entire website, including everything, here is the wget command syntax for you.

wget --mirror -p --convert-links -P ./home/Downloads http://domain.com

Linked pages in this article:

2 responses to “Download an entire website with wget

  1. Pingback: Link: Power of Linux wget Command to Downloand Files from Internet » TechNotes·

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s