Comment [#4299]

All Comments

1.00
Created by mumble 2 years, 4 months ago
[ 1.00 | 0.00 ] [#4299]

I found out why my spit-out-stats script didn't work. Seems non-ascii text snuck into the sw file.
This was the fix:
$ tr -cd '[:print:]\t\n' < full-k5.sw > clean-full-k5.sw

Now we have:
$ ./spit-out-sw-stats.sh clean-full-k5.sw
e3143a462c823f291b5cee9812469dd27da46801 *clean-full-k5.sw
(2.0G, 17 op types and 17123667 learn rules)
author: 1919026 learn rules
body: 1919026 learn rules
body-wc: 164811 learn rules
child-comment: 150274 learn rules
date: 164811 learn rules
date-time: 1919026 learn rules
how-many-comments: 164811 learn rules
intro: 164811 learn rules
intro-wc: 164811 learn rules
is-top-level-comment: 1754215 learn rules
parent-comment: 1073744 learn rules
parent-diary: 1754215 learn rules
tags: 53002 learn rules
title: 1919026 learn rules
total-wc: 164811 learn rules
url: 1919026 learn rules
wc: 1754215 learn rules

I also created a full diary index, a table generated with:
$ egrep "^date-time|^author|^title|^how-many-comments" clean-full-k5.sw > diary-index--full-k5.sw

sa: load diary-index--full-k5.sw
sa: table[diary,author,title,how-many-comments,date-time] rel-kets[how-many-comments]

That is here:
http://k5.semantic-db.org/full-k5-diary-index.txt

Heh. Beware the line-wrap :).

Many, many more ways to slice and dice the clean-full-k5.sw file!


[ Parent | Reply ]


0.00
Created by mumble 2 years, 4 months ago
[ 0.00 | 0.00 ] [#4300]

Further:
$ ./spit-out-sw-stats.sh diary-index--full-k5.sw
0dcf61a78ffe0aca079ce9f9f19d00116520bfed *diary-index--full-k5.sw
(435M, 4 op types and 5921889 learn rules)
author: 1919026 learn rules
date-time: 1919026 learn rules
how-many-comments: 164811 learn rules
title: 1919026 learn rules

From which we know there are 164,811 diaries in the data-set and 1,919,026 total posts (diaries + comments), ignoring a few duplicates due to overlap in the original sw files.


[ Parent | Reply ]