It's also nice to have a copy of threads on your own hard drive, for personal reference.
Accordingly, me and a friend have come up with this automated backup script which will copy the contents of every thread found in a Search Results page of your choosing to your hard drive. Pages in each thread will be organized by thread title - any further organization will be your own problem.
SYSTEM REQUIREMENTS
1. Reliable Internet connection - I'm not sure how fault-tolerant this script is, and if you lose connection in the middle of the operation you may have to delete the partial backup and start over.
2. 1-2gb free hard drive space. My own backup consisted of 235 threads and 1,500+ pages, totalling 255 megabytes - but your backups may be larger and a margin of free space is always good.
3. Linux or Mac OS X. There's a version of wget for Windows, but I have no idea how to call it from Python - help is welcome.
3. An up-to-date Python installation. Most Linux distributions include this. Mac users may have to install it themselves.
4. The Linux utility "wget". This may be included with your Linux distribution, or you may have to install it yourself. Mac users can get it through the Fink package manager (Refer to the instructions on the Fink website for help with its installation and operation).
5. Firefox and the Firefox plugin "Export Domain Cookies" (link) or another method of exporting the Jolt session cookies
6. Time. This script backs up one page's worth of posts every 2-3 seconds. This means a full backup can take hours. Be patient, it'll finish eventually.
INSTRUCTIONS FOR RESPONSIBLE USE
Instructions for the actual operation of this thing are in the script itself, at the top. That said, there are a couple things I would suggest about how to use this script:
1. Don't reduce the time.sleep commands to 0. This will introduce more room for instability and also may piss Jolt off if it causes the script to hammer their server.
2. Back up ONLY what you need. I suggest simply searching for your own username and backing that up (for example, I backed up the results of a search for username "Khrrck"). If you MUST back up the results of a keyword search, for the love of God don't back up common keywords. You'll make the script, the Jolt servers and me cry.
3. If you have the time and the ability to help, consider helping other users who are unable to run the script backup their threads. .ZIP or .RAR files containing backed-up threads up to 1gb in total can be traded via sites like Megaupload. Larger archives can be sent in segmented .RAR files.
- Code: Select all
#!/usr/bin/env python
#################################################################
#NATIONSTATES FORUM BACKUP SCRIPT (1.0, CREDIT KHRRCK AND AURIX)#
#################################################################
#INSTRUCTIONS#
##############
#INSTALLING#
############
#
# 1. You will need:
# 2. A Linux or OS X operating system (Windows MAY work if you can install wget and add it to your path, but it has not been tested)
# 3. wget, installed and in your path
# 4. Python
# 5. Firefox and the Firefox plugin "Export Domain Cookies" (or equivalent)
#
# When you have all that:
# 6. Copy this ENTIRE script into a PLAIN TEXT (Not RTF!) file in an empty folder. Rename it to "forum.py". Linux users may have to use "chmod +x forum.py" to make it executable.
#
###########
#OPERATING#
###########
#
# 1. Perform a search for the topics you want to back up. Note the SearchID in the URL result. Search must be in the right mode to return topics instead of posts, otherwise the script will not work.
# 2. Replace Xes in the "search_id = "XXXX"" line below with your SearchID.
# 3. For the script to run, it must access the password-protected search page. To do this, you must export your cookies.
# Use the Firefox plugin "Export Domain Cookies" to do this; after installing, go to the search result page, open the Tools menu and choose Export Domain Cookies. Do not log out while the script is running afterwards or the cookies will become invalid!
# 4. Place the cookies.txt file this generates in the same directory as the script file.
# 5. Run the script. (from the command line in the same directory as the script, the command is "./forum.py" Be patient - this will take a while. Hours, depending on the number and size of threads to back up. No folders will be generated until all pages of all threads have been downloaded, for reasons of simplicity and completeness.
# 6. Enjoy!
#
import re
import os
import sets
import time
#Replace Xes in the line directly below with the SearchID you found!
search_id = "XXXX"
thread_ids = []
for page in range(1,17):
os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" --load-cookies cookies.txt \"http://forums.joltonline.com/search.php?searchid=" + search_id + "&pp=15&page=" + str(page) + "\"")
time.sleep(1)
search_file = open("search.php?searchid=" + search_id + "&pp=15&page=" + str(page),"r")
dump = search_file.read()
search_file.close()
regex = "(showthread.php\?t=([0-9]+))"
result = re.findall(regex, dump)
for i in range(0,len(result)):
thread_ids.append(result[i][1])
thread_ids = list(sets.Set(thread_ids))
os.system("rm search.php*")
thread_ids_with_pages = []
for i in range(0,len(thread_ids)):
os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" \"http://forums.joltonline.com/showthread.php?t=" + thread_ids[int(i)] + "\"")
time.sleep(1)
current_file = open("showthread.php?t=" + thread_ids[i],"r")
dump = current_file.read()
current_file.close()
regex = "(Page 1 of ([0-9]+))"
result = re.search(regex, dump)
if result:
number_of_pages = result.group(2)
else:
number_of_pages = "1"
thread_ids_with_pages.append([thread_ids[i],number_of_pages])
for i in range(0,len(thread_ids_with_pages)):
if thread_ids_with_pages[i][1] >= 2:
for page_number in range(2,int(thread_ids_with_pages[i][1]) + 1):
os.system("wget --html-extension --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" \"http://forums.joltonline.com/showthread.php?t=" + thread_ids_with_pages[i][0] + "&page=" + str(page_number) + "\"")
time.sleep(3)
for file in os.listdir("."):
if os.path.isdir(file):
continue
current_file = open(file,"r")
dump = current_file.read()
current_file.close()
regex = "(<title>)(.+?)( - Page [0-9]+)?( - Jolt Forums)(</title>)"
result = re.search(regex, dump)
if result:
title = result.group(2)
title = title.strip()
title = title.replace("/","_")
os.system("mkdir \"" + title + "\"")
os.system("mv \"" + file + "\" \"" + title + "/" + file + "\"")