8.64
10 comments
Created by mumble 1 year, 2 months ago
[ 8.64 | 0.00 ] [#754]

OK. Since I was so obsessed with k5 for so long, I have quite a bit of k5 content.
With k5 down, I recommend you wget it all down, and spread it in all directions (maybe even a torrent?).
Seriously, my hosting gets almost no traffic, so it won't mind you downloading all of this.
Also, I've moved most of it from k5-stats.org (which will go away eventually) to k5.semantic-db.org which I'm going to keep in the long term.

Now, what do I have?
1) a big list of most k5 user names, and their user-id:
http://k5.semantic-db.org/full-k5-user-list.txt
This is a reminder of the k5 glory days, and of how many former kurons there really are.
eg, a quick look at the file has "104774 walverio" as the last entry. So there were at least 100,000 kurons! (ignoring dupes)

2) the results from my diary slurp:
http://k5.semantic-db.org/diary-slurp/

2.1) the script I used for this (heh, probably not much use now! though it shows how I converted k5 html into sw format):
https://github.com/GarryMorrison/Feynman-knowledge-engine/blob/master/create-sw-file-tools/slurp-diaries.py

2.2) in particular, the full set of k5 diaries (as of 2015-7-22), in nested-html form (1.3GB):
http://k5.semantic-db.org/diary-slurp/161942--archive-diaries--html-diaries--nested-format.zip

2.3) these diaries converted to my sw format (which is mostly useful for me, though shouldn't be hard to load into a real database) (443 MB):
http://k5.semantic-db.org/diary-slurp/full-k5--all-k5-diaries-in-sw-format.zip
And sw format is particularly easy to grep and sed to find what you want, since each learn rule is on one line (eg, newlines in posts have been escaped).

Now, a brief explanation of the sw format (as used in these files). They take the form:
some-operator |some ket> => |another ket>
some-operator |some ket> => |ket 1> + |ket 2> + |ket 3> + ... |ket n>

For example:
url |diary: 2006-2-4-92049-12395> => |url: http://www.kuro5hin.org/story/2006/2/4/92049/12395>
author |diary: 2006-2-4-92049-12395> => |kuron: gr3y>
title |diary: 2006-2-4-92049-12395> => |text: Behold the true face of Islam:>
child-comment |diary: 2006-2-4-92049-12395> => |comment: 2006-2-4-92049-12395-1> + |comment: 2006-2-4-92049-12395-2> + ...
author |comment: 2006-2-4-92049-12395-2> => |kuron: Lemon Juice>

Here are the stats for the full-k5-sw file:
$ ./spit-out-sw-stats.sh full-k5.sw
326e7a1af84da55e13ebcf328b166a398f3f0e59 *full-k5.sw
(2.1G, 18 op types and 868715 learn rules)
author: 96300 learn rules
body: 1204182 learn rules
body-wc: 85594 learn rules
child-comment: 82990 learn rules
date: 85594 learn rules
date-time: 1204182 learn rules
how-many-comments: 85594 learn rules
intro: 85594 learn rules
intro-wc: 85594 learn rules
is-top-level-comment: 1118588 learn rules
parent-comment: 696901 learn rules
parent-diary: 1118589 learn rules
tags: 53002 learn rules
title: 1204182 learn rules
total-wc: 85594 learn rules
url: 1204182 learn rules
wc: 1118588 learn rules

Bah! There is something wrong with that script.
This is the true number of total learn rules:
$ grep -c " => " full-k5.sw
17123667

Using the sw data, we can find things like:
comments per year:

diaries per year:

number of comments per diary:

OK. Let's dig into the details of that last graph. Use mumble lang to produce a table showing the diaries with the most comments:


$ ./the_semantic_db_console.py
Welcome!

sa: load full-k5--how-many-comments.sw
sa: rank-table[diary,how-many-comments] select[1,30] reverse sort-by[how-many-comments] rel-kets[how-many-comments]
+------+----------------------+-------------------+
| rank | diary | how-many-comments |
+------+----------------------+-------------------+
| 1 | 2002-5-7-202722-0453 | 835 |
| 2 | 2006-4-21-205756-081 | 356 |
| 3 | 2003-1-1-16127-40735 | 296 |
| 4 | 2004-5-18-13843-3909 | 276 |
| 5 | 2002-5-9-165212-3476 | 248 |
| 6 | 2004-9-7-41313-07421 | 244 |
| 7 | 2001-7-12-12343-2481 | 235 |
| 8 | 2002-4-26-142517-238 | 233 |
| 9 | 2006-1-5-16385-33113 | 223 |
| 10 | 2012-12-17-1649-9844 | 220 |
| 11 | 2002-4-23-104451-644 | 219 |
| 12 | 2003-3-19-8251-32475 | 211 |
| 13 | 2005-11-30-18310-255 | 210 |
| 14 | 2007-6-29-134551-001 | 207 |
| 15 | 2005-8-18-55631-0880 | 204 |
| 16 | 2005-12-30-13163-104 | 199 |
| 17 | 2002-4-25-93534-4989 | 198 |
| 18 | 2003-2-9-162755-1691 | 198 |
| 19 | 2005-12-28-14626-619 | 198 |
| 20 | 2007-5-7-103231-5181 | 198 |
| 21 | 2003-11-15-201021-60 | 193 |
| 22 | 2002-5-14-144525-460 | 189 |
| 23 | 2003-6-1-19331-10814 | 187 |
| 24 | 2006-3-7-171540-0259 | 186 |
| 25 | 2001-5-30-3634-11882 | 185 |
| 26 | 2003-4-25-21500-5428 | 185 |
| 27 | 2012-12-15-113011-53 | 184 |
| 28 | 2003-5-7-122415-5631 | 183 |
| 29 | 2005-1-14-172115-002 | 183 |
| 30 | 2003-1-31-880-29290 | 181 |
+------+----------------------+-------------------+

Next, the results from my monthly stats are here:
http://k5.semantic-db.org/k5-stats/

For example, last months stats:
http://k5.semantic-db.org/k5-stats/2016-03-27--results.sw

comments per month:

diaries per month:

Finally, the results from the first diary slurp:
http://k5.semantic-db.org/first-k5-slurp/

eg, things like "the official k5 dictionary":
http://k5.semantic-db.org/first-k5-slurp/the-official-k5-dictionary.txt

and the k5 frequency list:
http://k5.semantic-db.org/first-k5-slurp/the-k5-frequency-list.txt

The full data from that run (which only included 6,000 diaries) is here:
http://k5.semantic-db.org/first-k5-slurp/corpus/

And that is pretty much all I have!

Though back at my old host I still have the Crawfish archive:
http://crawfish.k5-stats.org/

Enjoy!

ps. I haven't heard back from Sye, so I don't know how much she has. But I think she has some of the section stories (she borrowed my script).


[ Reply ]

3.96
Created by mumble 1 year, 2 months ago
[ 3.96 | 0.00 ] [#4297]

I'm wondering if I should link this on a big-data reddit thread?
It is not everyday you get a full copy of 161,942 diaries, of a now defunct web-board.
Totaling 96,300 authors, and 1,118,588 comments (actually those numbers are a little high due to duplicates, caused by the way I made the final sw file from its' pieces)


[ Parent | Reply ]


0.00
Created by mirko 1 year, 2 months ago
[ 0.00 | 0.00 ] [#4306]

No, please leave Reddit out of the equation.


[ Parent | Reply ]


1.00
Created by CoyoteBrown 1 year, 2 months ago
[ 1.00 | 0.00 ] [#4313]

Why may I ask? I'm not defending reddit. It is just that I have read that sentence about five times and I don't understand it just comes across as reddit is uncool or unfit or something and god forbid we do something uncool like make a tweet or post to reddit.


[ Parent | Reply ]


1.00
Created by mumble 1 year, 2 months ago
[ 1.00 | 0.00 ] [#4307]

The point is, this was quite a bit of work and time on my part.
Perhaps it is useful for others, as a nice big dataset of a web 1.0 webboard.

BTW, good to see you can access your account now.


[ Parent | Reply ]


0.00
Created by CoyoteBrown 1 year, 2 months ago
[ 0.00 | 0.00 ] [#4312]

Didn't someone do something similar with some big dataset? Some years ago, Alta Vista search records or along those lines. Of course in the media there was some controversy about if the data could be scrubbed, (is that even the right term?) enough to keep it anonymous and a few people were 'outed' by some embarrassing searches. But overall I think it was very much welcomed. If I'm not mistaken there was some very interesting and important discoveries made. Well, interesting and important to those that study this type of thing. Probably not so much to Joe Sixpack.

Reddit would be a good venue if only because it is big enough that the right people might have a chance to see it not because they themselves read reddit but because everybody knows somebody that reads reddit.


[ Parent | Reply ]


3.96
Created by anonymous 1 year, 2 months ago
[ 3.96 | 0.00 ] [#4294]

that's excellent, man.
Mirko


[ Parent | Reply ]


1.00
Created by alevin 1 year, 2 months ago
[ 1.00 | 0.00 ] [#4319]

nice work, man. great to have the archive. may she RIP.


[ Parent | Reply ]


1.00
Created by mumble 1 year, 2 months ago
[ 1.00 | 0.00 ] [#4299]

I found out why my spit-out-stats script didn't work. Seems non-ascii text snuck into the sw file.
This was the fix:
$ tr -cd '[:print:]\t\n' < full-k5.sw > clean-full-k5.sw

Now we have:
$ ./spit-out-sw-stats.sh clean-full-k5.sw
e3143a462c823f291b5cee9812469dd27da46801 *clean-full-k5.sw
(2.0G, 17 op types and 17123667 learn rules)
author: 1919026 learn rules
body: 1919026 learn rules
body-wc: 164811 learn rules
child-comment: 150274 learn rules
date: 164811 learn rules
date-time: 1919026 learn rules
how-many-comments: 164811 learn rules
intro: 164811 learn rules
intro-wc: 164811 learn rules
is-top-level-comment: 1754215 learn rules
parent-comment: 1073744 learn rules
parent-diary: 1754215 learn rules
tags: 53002 learn rules
title: 1919026 learn rules
total-wc: 164811 learn rules
url: 1919026 learn rules
wc: 1754215 learn rules

I also created a full diary index, a table generated with:
$ egrep "^date-time|^author|^title|^how-many-comments" clean-full-k5.sw > diary-index--full-k5.sw

sa: load diary-index--full-k5.sw
sa: table[diary,author,title,how-many-comments,date-time] rel-kets[how-many-comments]

That is here:
http://k5.semantic-db.org/full-k5-diary-index.txt

Heh. Beware the line-wrap :).

Many, many more ways to slice and dice the clean-full-k5.sw file!


[ Parent | Reply ]


0.00
Created by mumble 1 year, 2 months ago
[ 0.00 | 0.00 ] [#4300]

Further:
$ ./spit-out-sw-stats.sh diary-index--full-k5.sw
0dcf61a78ffe0aca079ce9f9f19d00116520bfed *diary-index--full-k5.sw
(435M, 4 op types and 5921889 learn rules)
author: 1919026 learn rules
date-time: 1919026 learn rules
how-many-comments: 164811 learn rules
title: 1919026 learn rules

From which we know there are 164,811 diaries in the data-set and 1,919,026 total posts (diaries + comments), ignoring a few duplicates due to overlap in the original sw files.


[ Parent | Reply ]


1.01
Created by United_Fools 1 year, 2 months ago
[ 1.01 | 0.00 ] [#4296]

thank you for preserving all the foolishness!


[ Parent | Reply ]