NATION

PASSWORD

Automated Forum Backup Script (Get copies of Jolt threads)

Bug reports, general help, ideas for improvements, and questions about how things are meant to work.
User avatar
Khrrck
Civil Servant
 
Posts: 8
Founded: Dec 05, 2003
Left-Leaning College State

Automated Forum Backup Script (Get copies of Jolt threads)

Postby Khrrck » Sat Aug 01, 2009 6:58 pm

Call me paranoid, but I'm pretty sure that Jolt will eventually delete their NationStates forum sections. If that happens, everything in them (including old Pre-Jolt threads) will be lost.

It's also nice to have a copy of threads on your own hard drive, for personal reference.

Accordingly, me and a friend have come up with this automated backup script which will copy the contents of every thread found in a Search Results page of your choosing to your hard drive. Pages in each thread will be organized by thread title - any further organization will be your own problem.

SYSTEM REQUIREMENTS
1. Reliable Internet connection - I'm not sure how fault-tolerant this script is, and if you lose connection in the middle of the operation you may have to delete the partial backup and start over.
2. 1-2gb free hard drive space. My own backup consisted of 235 threads and 1,500+ pages, totalling 255 megabytes - but your backups may be larger and a margin of free space is always good.
3. Linux or Mac OS X. There's a version of wget for Windows, but I have no idea how to call it from Python - help is welcome.
3. An up-to-date Python installation. Most Linux distributions include this. Mac users may have to install it themselves.
4. The Linux utility "wget". This may be included with your Linux distribution, or you may have to install it yourself. Mac users can get it through the Fink package manager (Refer to the instructions on the Fink website for help with its installation and operation).
5. Firefox and the Firefox plugin "Export Domain Cookies" (link) or another method of exporting the Jolt session cookies
6. Time. This script backs up one page's worth of posts every 2-3 seconds. This means a full backup can take hours. Be patient, it'll finish eventually.

INSTRUCTIONS FOR RESPONSIBLE USE
Instructions for the actual operation of this thing are in the script itself, at the top. That said, there are a couple things I would suggest about how to use this script:

1. Don't reduce the time.sleep commands to 0. This will introduce more room for instability and also may piss Jolt off if it causes the script to hammer their server.

2. Back up ONLY what you need. I suggest simply searching for your own username and backing that up (for example, I backed up the results of a search for username "Khrrck"). If you MUST back up the results of a keyword search, for the love of God don't back up common keywords. You'll make the script, the Jolt servers and me cry.

3. If you have the time and the ability to help, consider helping other users who are unable to run the script backup their threads. .ZIP or .RAR files containing backed-up threads up to 1gb in total can be traded via sites like Megaupload. Larger archives can be sent in segmented .RAR files.

Code: Select all
#!/usr/bin/env python

#################################################################
#NATIONSTATES FORUM BACKUP SCRIPT (1.0, CREDIT KHRRCK AND AURIX)#
#################################################################
#INSTRUCTIONS#
##############
#INSTALLING#
############
#
#  1. You will need:
#  2. A Linux or OS X operating system (Windows MAY work if you can install wget and add it to your path, but it has not been tested)
#  3. wget, installed and in your path
#  4. Python
#  5. Firefox and the Firefox plugin "Export Domain Cookies" (or equivalent)
#
#  When you have all that:
#  6. Copy this ENTIRE script into a PLAIN TEXT (Not RTF!) file in an empty folder. Rename it to "forum.py". Linux users may have to use "chmod +x forum.py" to make it executable.
#
###########
#OPERATING#
###########
#
#  1. Perform a search for the topics you want to back up. Note the SearchID in the URL result. Search must be in the right mode to return topics instead of posts, otherwise the script will not work.
#  2. Replace Xes in the "search_id = "XXXX"" line below with your SearchID.
#  3. For the script to run, it must access the password-protected search page. To do this, you must export your cookies.
#  Use the Firefox plugin "Export Domain Cookies" to do this; after installing, go to the search result page, open the Tools menu and choose Export Domain Cookies. Do not log out while the script is running afterwards or the cookies will become invalid!
#  4. Place the cookies.txt file this generates in the same directory as the script file.
#  5. Run the script. (from the command line in the same directory as the script, the command is "./forum.py" Be patient - this will take a while. Hours, depending on the number and size of threads to back up. No folders will be generated until all pages of all threads have been downloaded, for reasons of simplicity and completeness.
#  6. Enjoy!
#

import re
import os
import sets
import time

#Replace Xes in the line directly below with the SearchID you found!

search_id = "XXXX"
thread_ids = []

for page in range(1,17):
    os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" --load-cookies cookies.txt \"http://forums.joltonline.com/search.php?searchid=" + search_id + "&pp=15&page=" + str(page) + "\"")

    time.sleep(1)

    search_file = open("search.php?searchid=" + search_id + "&pp=15&page=" + str(page),"r")
    dump = search_file.read()
    search_file.close()

    regex = "(showthread.php\?t=([0-9]+))"
    result = re.findall(regex, dump)

    for i in range(0,len(result)):
        thread_ids.append(result[i][1])

thread_ids = list(sets.Set(thread_ids))

os.system("rm search.php*")

thread_ids_with_pages = []

for i in range(0,len(thread_ids)):
    os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\"  \"http://forums.joltonline.com/showthread.php?t=" + thread_ids[int(i)] + "\"")

    time.sleep(1)

    current_file = open("showthread.php?t=" + thread_ids[i],"r")
    dump = current_file.read()
    current_file.close()

    regex = "(Page 1 of ([0-9]+))"
    result = re.search(regex, dump)

    if result:
        number_of_pages = result.group(2)
    else:
        number_of_pages = "1"

    thread_ids_with_pages.append([thread_ids[i],number_of_pages])

for i in range(0,len(thread_ids_with_pages)):
    if thread_ids_with_pages[i][1] >= 2:
        for page_number in range(2,int(thread_ids_with_pages[i][1]) + 1):
            os.system("wget --html-extension --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" \"http://forums.joltonline.com/showthread.php?t=" + thread_ids_with_pages[i][0] + "&page=" + str(page_number) + "\"")

            time.sleep(3)

for file in os.listdir("."):

    if os.path.isdir(file):
        continue

    current_file = open(file,"r")
    dump = current_file.read()
    current_file.close()

    regex = "(<title>)(.+?)( - Page [0-9]+)?( - Jolt Forums)(</title>)"
    result = re.search(regex, dump)

    if result:
        title = result.group(2)
        title = title.strip()
        title = title.replace("/","_")
        os.system("mkdir \"" + title + "\"")
        os.system("mv \"" + file + "\" \"" + title + "/" + file + "\"")
Last edited by Khrrck on Sat Aug 01, 2009 7:00 pm, edited 1 time in total.

User avatar
Charlotte Ryberg
The Muse of the Westcountry
 
Posts: 15007
Founded: Mar 14, 2007
Civil Rights Lovefest

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Postby Charlotte Ryberg » Sun Aug 02, 2009 2:35 am

Genius. I'd like to point out that after a bit of conversion, some threads can also be hosted on websites instead of being shared in RAR files. Some users may like to post highlights of some popular threads on NSWiki too.

User avatar
Khrrck
Civil Servant
 
Posts: 8
Founded: Dec 05, 2003
Left-Leaning College State

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Postby Khrrck » Sun Aug 02, 2009 11:33 pm

So are people using this, and is it working?

User avatar
The Mindset
Envoy
 
Posts: 267
Founded: Antiquity
Ex-Nation

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Postby The Mindset » Thu Aug 06, 2009 5:08 am

This inspired me to create this: http://www.esusalliance.co.uk/joltbackup.php

It'll take a thread id and the number of pages you'd like to backup and format it into a single, clean HTML file which you can then save as a local copy. It only loads pages at a rate of once per two seconds, so for long threads it may take a while to process.

You don't need any of the crap above to make it work. Just plug in the id, and it'll work, no matter which platform you're on.
Last edited by The Mindset on Thu Aug 06, 2009 6:38 am, edited 1 time in total.

User avatar
Khrrck
Civil Servant
 
Posts: 8
Founded: Dec 05, 2003
Left-Leaning College State

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Postby Khrrck » Thu Aug 06, 2009 6:32 pm

The Mindset wrote:This inspired me to create this: http://www.esusalliance.co.uk/joltbackup.php

It'll take a thread id and the number of pages you'd like to backup and format it into a single, clean HTML file which you can then save as a local copy. It only loads pages at a rate of once per two seconds, so for long threads it may take a while to process.

You don't need any of the crap above to make it work. Just plug in the id, and it'll work, no matter which platform you're on.


This is a great tool for individual threads and I thank you for making it. ;) I will note, though, that it only works for individual threads - my script will back up in bulk if you have a lot of threads you want to save.

User avatar
Khrrck
Civil Servant
 
Posts: 8
Founded: Dec 05, 2003
Left-Leaning College State

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Postby Khrrck » Thu Aug 06, 2009 8:01 pm

Whoops! Turns out it doesn't work. We're debugging right now. Hold on. ;)

User avatar
Khrrck
Civil Servant
 
Posts: 8
Founded: Dec 05, 2003
Left-Leaning College State

Re: Automated Forum Backup Script (Get copies of Jolt threads)

Postby Khrrck » Sat Aug 08, 2009 8:23 pm

Latest iteration. Directions are the same as for the previous version.

Code: Select all
#!/usr/bin/env python

import re
import os
import sets
import time

search_id = "XXXX"
thread_ids = []

os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" --load-cookies cookies.txt \"http://forums.joltonline.com/search.php?searchid=" + search_id + "\"")

search_file = open("search.php?searchid=" + search_id,"r")
dump = search_file.read()
search_file.close()

regex = "(Page 1 of )([0-9]+)"
result = re.search(regex, dump)

if result:
    range_=int(result.group(2))+1
else:
    range_=2

for page in range(1,range_):
    os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" --load-cookies cookies.txt \"http://forums.joltonline.com/search.php?searchid=" + search_id + "&pp=15&page=" + str(page) + "\"")

    time.sleep(3)

    search_file = open("search.php?searchid=" + search_id + "&pp=15&page=" + str(page),"r")
    dump = search_file.read()
    search_file.close()

    regex = "(showthread.php\?t=([0-9]+))"
    result = re.findall(regex, dump)

    for i in range(0,len(result)):
        thread_ids.append(result[i][1])

thread_ids = list(sets.Set(thread_ids))

os.system("rm search.php*")

thread_ids_with_pages = []

for i in range(0,len(thread_ids)):
    os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\"  \"http://forums.joltonline.com/showthread.php?t=" + thread_ids[int(i)] + "\"")

    time.sleep(3)

    current_file = open("showthread.php?t=" + thread_ids[i],"r")
    dump = current_file.read()
    current_file.close()

    regex = "(Page 1 of ([0-9]+))"
    result = re.search(regex, dump)

    if result:
        number_of_pages = result.group(2)
    else:
        number_of_pages = "1"

    thread_ids_with_pages.append([thread_ids[i],number_of_pages])

for i in range(0,len(thread_ids_with_pages)):
    if thread_ids_with_pages[i][1] >= 2:
        for page_number in range(2,int(thread_ids_with_pages[i][1]) + 1):
            os.system("wget --user-agent=\"Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.1) Gecko/20090715 Firefox/3.5.1\" \"http://forums.joltonline.com/showthread.php?t=" + thread_ids_with_pages[i][0] + "&page=" + str(page_number) + "\"")

            time.sleep(3)

for file in os.listdir("."):

    if os.path.isdir(file):
        continue

    current_file = open(file,"r")
    dump = current_file.read()
    current_file.close()

    regex = "(<title>)(.+?)( - Page [0-9]+)?( - Jolt Forums)(</title>)"
    result = re.search(regex, dump)

    if result:
        title = result.group(2)
        title = title.strip()
        title = title.replace("/","_")
        os.system("mkdir \"" + title + "\"")
        os.system("mv \"" + file + "\" \"" + title + "/" + file + "\"")


Advertisement

Remove ads

Return to Technical

Who is online

Users browsing this forum: Minoa, Talvezout

Advertisement

Remove ads