Data mining (help!)

I am examining how certain phrases crop up in vBulletin web forums to see how ideas spread among Internet audiences over time. I’m having problems and suggestions are welcome. (Update: to see how this research worked out, see this post)

Method at the moment…

Step 1.

After trying a few alternatives I’ve started to use an application called Sitesucker that downloads the messages from the forum onto a local disk as a series of html files.

Step 2.

Data mining with Anthracite, which seems to have the right features (it can strip the meta text and finds phrases with excerpts surrounding them, it can also grab the date of posting, user’s reputation etc), but it keeps crashing when I feed it larger numbers of files.

What I need is a tool to go through the forum, find a number of phrases that I am interested in, and then a) extract the paragraphs or a set number of characters around each instance of the target phrases; b) extract the time and date of the posts they appear in; c) the title of the thread; d) and the user name and reputation of the poster; e) and possibly a summary of follow up posts. Then the tool would ideally produce a file I can look through manually – and  output as a CSV for statistical analysis.

Any ideas?


Post script – Daniel Lee may have a solution… Will update shortly.


Post script 2:

Since posting this a possible alternatives suggested by:

Skec suggests Web Sphinx, which is a great java crawler and can certainly find instances of text, but would take allot of work (correct me if I’m wrong) to effectively excerpt the data I need.

Martin suggests Automap, which may come in handy at a later stage if I need to use machine analysis to how frequencies of phrases appearing near to other relevant terms.

Mat Morrison suggests that an added bonus would be to use (interesting idea), and says that I might have to go the route.

I’ll try them all, but right now I’m trying to figure out Anthracite better, and waiting on Daniel’s custom tool.

Ideas and suggested apps/approaches are still welcome!

One thought on “Data mining (help!)

  1. Rapidminer –>

    Now trying out Rapidminer, Weka, and may have to try SPSS Modeller… any recommendations would be welcome.

    Ok, just did a review of my dataset:
    This dataset is truly massive. It contains 944,863 messages taken from nine different web forums. The smallest sample in the dataset contains only 891 messages (less than one percent of the total dataset), yet each of these messages are made up of 207 words on average for a total of 184,629 words. The overall dataset comprises almost a million messages across the nine forums, representing roughly 195,586,641 words of text, or 488,967 pages of A4 print (based on 400 words per page). Big.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s