Title: [CHART] Press Section Statistics
Post by: kiko on April 28, 2013, 09:28:12 AM
Getting to grips with gnuplot has been on my ttd list for a while. So I pulled down some data from this right here section. Here are the results. https://i.imgur.com/cMW1P6N.pngYou can see 4 clear media cycles in the last year. My only question is how big is the next one going to be? :o for any linux geeks who want to play: press_scaper.sh #!/bin/bash
# press_scraper.sh - scrape and collate bitcoin press articles, output csv. # usage - ./press_scraper.sh
# This program is free software. It comes without any warranty, to # the extent permitted by applicable law. You can redistribute it # and/or modify it under the terms of the Do What The Fuck You Want # To Public License, Version 2, as published by Sam Hocevar. See # http://sam.zoy.org/wtfpl/COPYING for more details.
total_articles=1760 decrement=40 tempfile=$(mktemp) outfile=press_articles.csv
[ -f $tempfile ] || { echo "Error: Could not make temporary file. Exiting..."; \ exit 1 ; }
function scrape { curl "$1" | sed -rn 's#.*<span id="msg_[0-9]+"><a href="https://bitcointalk\.org/index\.php\?topic=[0-9]+(\.0)?">([0-9]{4}-[0-1][0-9]-[0-3][0-9]).*</a></span>#\2#p' ; }
for ((x=total_articles; x>40; x-=decrement)) do scrape "https://bitcointalk.org/index.php?board=77.$x" >> $tempfile sleep 5 # This is here just to be kind to the server, remove for speedup. done
scrape "https://bitcointalk.org/index.php?board=77" >> $tempfile
sort $tempfile | uniq -c | sed -r 's/^ *([0-9]+) (.*)/\1,\2/' >$outfile gnuplot_commands reset clear set xdata time set format x "%Y-%m-%d" set timefmt "%Y-%m-%d" set datafile separator "," set style fill solid noborder set xtics rotate by -90 out nomirror 604800 set ytics out nomirror set grid ytics set ylabel "Press hits/day" set xrange ["2012-04-07":"2013-04-26"] set yrange [0:*] set boxwidth 43200 absolute set datafile separator "," set term pngcairo truecolor font "Arial,11" size 1200,1200 set output "press_hits.png" plot "press_articles.csv" using 2:1 with boxes ti "Press Article Frequency" lt 1 linecolor rgb "#FF0000"
Title: Re: [CHART] Press Section Statistics
Post by: grondilu on April 28, 2013, 11:09:03 AM
Nice script. My first advice: don't use tempfiles. They always mess up your directory as you always forget to remove them. Just make proper unix pipes: reading stdin, output to stdout. #!/bin/bash
total_articles=1760 decrement=40
function scrape { curl "$1" | sed -rn 's#.*<span id="msg_[0-9]+"><a href="https://bitcointalk\.org/index\.php\?topic=[0-9]+(\.0)?">([0-9]{4}-[0-1][0-9]-[0-3][0-9]).*</a></span>#\2#p' ; }
{ for ((x=total_articles; x>40; x-=decrement)) do scrape "https://bitcointalk.org/index.php?board=77.$x" sleep 5 # This is here just to be kind to the server, remove for speedup. done
scrape "https://bitcointalk.org/index.php?board=77" } | sort | uniq -c | sed -r 's/^ *([0-9]+) (.*)/\1,\2/' Not tested yet but this should work as well as your initial code. Update: Second advice: provide your parameters as arguments to your script, with default values total_articles="${1:-1760}" decrement="${2:-40}"
Title: Re: [CHART] Press Section Statistics
Post by: kiko on April 28, 2013, 12:00:02 PM
Cool, thanks for the tips. My /tmp is just a ramdisk so I tend to abuse it.
Title: Re: [CHART] Press Section Statistics
Post by: Grinder on April 28, 2013, 03:20:20 PM
It's hard to get a good impression with so many long thin lines. I suggest plotting weekly sums or weekly moving average instead.
Title: Re: [CHART] Press Section Statistics
Post by: kiko on April 28, 2013, 04:42:31 PM
It's hard to get a good impression with so many long thin lines. I suggest plotting weekly sums or weekly moving average instead.
Version with curve-fitted bezier trendline: https://i.imgur.com/oWyDKdR.png
|