Automated Forum Backup Script (Get copies of Jolt threads)

by **Khrrck** » Sat Aug 01, 2009 6:58 pm

Call me paranoid, but I'm pretty sure that Jolt will eventually delete their NationStates forum sections. If that happens, everything in them (including old Pre-Jolt threads) will be lost.

It's also nice to have a copy of threads on your own hard drive, for personal reference.

Accordingly, me and a friend have come up with this automated backup script which will copy the contents of every thread found in a Search Results page of your choosing to your hard drive. Pages in each thread will be organized by thread title - any further organization will be your own problem.

SYSTEM REQUIREMENTS
1. Reliable Internet connection - I'm not sure how fault-tolerant this script is, and if you lose connection in the middle of the operation you may have to delete the partial backup and start over.
2. 1-2gb free hard drive space. My own backup consisted of 235 threads and 1,500+ pages, totalling 255 megabytes - but your backups may be larger and a margin of free space is always good.
3. Linux or Mac OS X. There's a version of wget for Windows, but I have no idea how to call it from Python - help is welcome.
3. An up-to-date Python installation. Most Linux distributions include this. Mac users may have to install it themselves.
4. The Linux utility "wget". This may be included with your Linux distribution, or you may have to install it yourself. Mac users can get it through the Fink package manager (Refer to the instructions on the Fink website for help with its installation and operation).
5. Firefox and the Firefox plugin "Export Domain Cookies" (link) or another method of exporting the Jolt session cookies
6. Time. This script backs up one page's worth of posts every 2-3 seconds. This means a full backup can take hours. Be patient, it'll finish eventually.

INSTRUCTIONS FOR RESPONSIBLE USE
Instructions for the actual operation of this thing are in the script itself, at the top. That said, there are a couple things I would suggest about how to use this script:

1. Don't reduce the time.sleep commands to 0. This will introduce more room for instability and also may piss Jolt off if it causes the script to hammer their server.

2. Back up ONLY what you need. I suggest simply searching for your own username and backing that up (for example, I backed up the results of a search for username "Khrrck"). If you MUST back up the results of a keyword search, for the love of God don't back up common keywords. You'll make the script, the Jolt servers and me cry.

3. If you have the time and the ability to help, consider helping other users who are unable to run the script backup their threads. .ZIP or .RAR files containing backed-up threads up to 1gb in total can be traded via sites like Megaupload. Larger archives can be sent in segmented .RAR files.

Code: Select all: #!/usr/bin/env python ################################################################# #NATIONSTATES FORUM BACKUP SCRIPT (1.0, CREDIT KHRRCK AND AURIX)# ################################################################# #INSTRUCTIONS# ############## #INSTALLING# ############ # # 1. You will need: # 2. A Linux or OS X operating system (Windows MAY work if you can install wget and add it to your path, but it has not been tested) # 3. wget, installed and in your path # 4. Python # 5. Firefox and the Firefox plugin "Export Domain Cookies" (or equivalent) # # When you have all that: # 6. Copy this ENTIRE script into a PLAIN TEXT (Not RTF!) file in an empty folder. Rename it to "forum.py". Linux users may have to use "chmod +x forum.py" to make it executable. # ########### #OPERATING# ########### # # 1. Perform a search for the topics you want to back up. Note the SearchID in the URL result. Search must be in the right mode to return topics instead of posts, otherwise the script will not work. # 2. Replace Xes in the "search_id = "XXXX"" line below with your SearchID. # 3. For the script to run, it must access the password-protected search page. To do this, you must export your cookies. # Use the Firefox plugin "Export Domain Cookies" to do this; after installing, go to the search result page, open the Tools menu and choose Export Domain Cookies. Do not log out while the script is running afterwards or the cookies will become invalid! # 4. Place the cookies.txt file this generates in the same directory as the script file. # 5. Run the script. (from the command line in the same directory as the script, the command is "./forum.py" Be patient - this will take a while. Hours, depending on the number and size of threads to back up. No folders will be generated until all pages of all threads have been downloaded, for reasons of simplicity and completeness. # 6. Enjoy! # import re import os import sets import time #Replace Xes in the line directly below with the SearchID you found! search_id = "XXXX" thread_ids = [] for page in range(1,17): os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" --load-cookies cookies.txt \"http://forums.joltonline.com/search.php?searchid=" + search_id + "&pp=15&page=" + str(page) + "\"") time.sleep(1) search_file = open("search.php?searchid=" + search_id + "&pp=15&page=" + str(page),"r") dump = search_file.read() search_file.close() regex = "(showthread.php\?t=([0-9]+))" result = re.findall(regex, dump) for i in range(0,len(result)): thread_ids.append(result[i][1]) thread_ids = list(sets.Set(thread_ids)) os.system("rm search.php*") thread_ids_with_pages = [] for i in range(0,len(thread_ids)): os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" \"http://forums.joltonline.com/showthread.php?t=" + thread_ids[int(i)] + "\"") time.sleep(1) current_file = open("showthread.php?t=" + thread_ids[i],"r") dump = current_file.read() current_file.close() regex = "(Page 1 of ([0-9]+))" result = re.search(regex, dump) if result: number_of_pages = result.group(2) else: number_of_pages = "1" thread_ids_with_pages.append([thread_ids[i],number_of_pages]) for i in range(0,len(thread_ids_with_pages)): if thread_ids_with_pages[i][1] >= 2: for page_number in range(2,int(thread_ids_with_pages[i][1]) + 1): os.system("wget --html-extension --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" \"http://forums.joltonline.com/showthread.php?t=" + thread_ids_with_pages[i][0] + "&page=" + str(page_number) + "\"") time.sleep(3) for file in os.listdir("."): if os.path.isdir(file): continue current_file = open(file,"r") dump = current_file.read() current_file.close() regex = "(<title>)(.+?)( - Page [0-9]+)?( - Jolt Forums)(</title>)" result = re.search(regex, dump) if result: title = result.group(2) title = title.strip() title = title.replace("/","_") os.system("mkdir \"" + title + "\"") os.system("mv \"" + file + "\" \"" + title + "/" + file + "\"")

by **Charlotte Ryberg** » Sun Aug 02, 2009 2:35 am

Genius. I'd like to point out that after a bit of conversion, some threads can also be hosted on websites instead of being shared in RAR files. Some users may like to post highlights of some popular threads on NSWiki too.

by **Khrrck** » Sun Aug 02, 2009 11:33 pm

So are people using this, and is it working?

by **The Mindset** » Thu Aug 06, 2009 5:08 am

This inspired me to create this: http://www.esusalliance.co.uk/joltbackup.php

It'll take a thread id and the number of pages you'd like to backup and format it into a single, clean HTML file which you can then save as a local copy. It only loads pages at a rate of once per two seconds, so for long threads it may take a while to process.

You don't need any of the crap above to make it work. Just plug in the id, and it'll work, no matter which platform you're on.

by **Khrrck** » Thu Aug 06, 2009 6:32 pm

The Mindset wrote:This inspired me to create this: http://www.esusalliance.co.uk/joltbackup.php

It'll take a thread id and the number of pages you'd like to backup and format it into a single, clean HTML file which you can then save as a local copy. It only loads pages at a rate of once per two seconds, so for long threads it may take a while to process.

You don't need any of the crap above to make it work. Just plug in the id, and it'll work, no matter which platform you're on.

This is a great tool for individual threads and I thank you for making it.

I will note, though, that it only works for individual threads - my script will back up in bulk if you have a lot of threads you want to save.

by **Khrrck** » Thu Aug 06, 2009 8:01 pm

Whoops! Turns out it doesn't work. We're debugging right now. Hold on.

by **Khrrck** » Sat Aug 08, 2009 8:23 pm

Latest iteration. Directions are the same as for the previous version.

Code: Select all: #!/usr/bin/env python import re import os import sets import time search_id = "XXXX" thread_ids = [] os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" --load-cookies cookies.txt \"http://forums.joltonline.com/search.php?searchid=" + search_id + "\"") search_file = open("search.php?searchid=" + search_id,"r") dump = search_file.read() search_file.close() regex = "(Page 1 of )([0-9]+)" result = re.search(regex, dump) if result: range_=int(result.group(2))+1 else: range_=2 for page in range(1,range_): os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" --load-cookies cookies.txt \"http://forums.joltonline.com/search.php?searchid=" + search_id + "&pp=15&page=" + str(page) + "\"") time.sleep(3) search_file = open("search.php?searchid=" + search_id + "&pp=15&page=" + str(page),"r") dump = search_file.read() search_file.close() regex = "(showthread.php\?t=([0-9]+))" result = re.findall(regex, dump) for i in range(0,len(result)): thread_ids.append(result[i][1]) thread_ids = list(sets.Set(thread_ids)) os.system("rm search.php*") thread_ids_with_pages = [] for i in range(0,len(thread_ids)): os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" \"http://forums.joltonline.com/showthread.php?t=" + thread_ids[int(i)] + "\"") time.sleep(3) current_file = open("showthread.php?t=" + thread_ids[i],"r") dump = current_file.read() current_file.close() regex = "(Page 1 of ([0-9]+))" result = re.search(regex, dump) if result: number_of_pages = result.group(2) else: number_of_pages = "1" thread_ids_with_pages.append([thread_ids[i],number_of_pages]) for i in range(0,len(thread_ids_with_pages)): if thread_ids_with_pages[i][1] >= 2: for page_number in range(2,int(thread_ids_with_pages[i][1]) + 1): os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" \"http://forums.joltonline.com/showthread.php?t=" + thread_ids_with_pages[i][0] + "&page=" + str(page_number) + "\"") time.sleep(3) for file in os.listdir("."): if os.path.isdir(file): continue current_file = open(file,"r") dump = current_file.read() current_file.close() regex = "(<title>)(.+?)( - Page [0-9]+)?( - Jolt Forums)(</title>)" result = re.search(regex, dump) if result: title = result.group(2) title = title.strip() title = title.replace("/","_") os.system("mkdir \"" + title + "\"") os.system("mv \"" + file + "\" \"" + title + "/" + file + "\"")

Automated Forum Backup Script (Get copies of Jolt threads)

Automated Forum Backup Script (Get copies of Jolt threads)

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Who is online