v 0.1.2
This is ybsnarfz a rather simplistic package to snarf the yahoo financial boards
for any given stock.
v 0.1.2
Please run FixTime.pl to fix times between midnight and 1 am.
License:
All files, scripts and programs in this package are available under terms of the GPL
v2 from the FSF. A copy of that license is in this package.
Dependencies:
perl. Obviously, I run 5.8.5 but they should work on somewhat earlier versions
as well.
Getopt::Long package for perl
Mysql. Nothing exotic except that it uses the "replace into" which may require
v4 of mysql.
DBD/DBH for mysql.
wget. This is how data is pulled from yahoo.
linux/posix/unix. Internally there are some system calls to standard posix/unix
commands, cat, ls, and mv.
Package Contents:
for v 0.1.0
CHANGELOG - List of changes.
detag.pl - The html scanner program.
FindBadDates.pl - Refetch posts where the date is bad.
FindMissing.pl - Fetch any posts missing from the db (not deleted).
FixTime.pl - Fix bad times between midnight and 1 am.
GPLv2.txt - Text copy of the GPL v2 license.
htdocs/ - directory for php scripts etc
htdocs/ybsnarfz.php - A PHP to display data.
htdocs/scox-properties.php - A sample properties file.
README - This file.
sample-output/ - some sample output from ybsnarfz.php
scox.properties - Sample properties file.
showMessages.pl - Display (not ready for prime time)
TODO - Things which might get done.
UpdateRecs.pl - Go back and update number of recs.
work/ - working directory to accumulate messages.
yahooGetLast.pl - Scan to find last message number.
yahooMsgScan.pl - The message parser program.
yahooRecsParse.pl - Parse the message list for recs.
yboard.tabdef - The input to mysql to create the messages table.
ybScan.pl - The program which saves messages to the db.
YbsOptions.pm - Header and common code used by other programs.
ypull.pl - The program to pull messages from the yahoo board.
Installation and Operation:
Unpack the tarball: tar -xvzf ybsnarfz-X.X.X.tar.gz. This should create the
directory ybsnarf-X.X.X (the X's are the version number, very low!). You may
leave them where they are or move things as desired. The programs expect each
other and the properties files and the work directory relative to where they
are run; they will also write some transient files there (um, should be changed
to use a temp directory).
Create a database and user to access it. Currently it will only build one table.
The example I've used is "yboard" for everything; if the user yboard is not
otherwise used this should not be a problem to follow this exactly; otherwise,
or if paranoid change as desired.
mysqladmin -u<your master> -p<your masterpw> create yboard
mysql -u<your master> -p<your masterpw> mysql
grant all privileges on yboard.* to yboard@localhost identified by 'yboard';
exit;
Create the table.
cd into the install directory
mysql -uyboard -pyboard yboard <yboard.tabdef
Fix the configuration file: For SCOX the scox.properties file may look like this:
domain = Yahoo Finance Board
locus = SCOX
boardid = 1600684464
boardname = cald
workdir = work
dbhost = localhost
dbuser = yboard
dbpass = yboard
dbset = yboard
domain is the primary qualifier. locus is really not used but is there jic. boardid
is required by wget for yahoo to qualify the url as is boardname. Note that the
domain and boardid, along with the message number is the primary key for the messages
table. workdir, is the name of the work directory where pulled posts are stored until
ybScan.pl processes them. The dbxxxx settings are for access to your database.
All options:
boardid - The numeric id used by yahoo to uniquely identify a
message board. Can be seen in the url. Required.
boardname - The cannonical name of the board in the yahoo url.
Required.
dbhost - The hostname to connect to the database. defaults as
'localhost'.
dbpass - The password to connect to the database. Default is
'yboard'.
dbset - The database name. Default is 'yboard'.
dbuser - The database connection id. Default 'yboard'.
detag - The html scanner program. Defaults to "detag.pl" (part
of this package) under the directory specified in
execdir.
domain - The domain of messages. This is text and can be any
value and will be the same for all rows as part of
the primary key but is required.
execdir - Defaults to ".", and tells the ybScan program where to
find the html scanner and message parser. This is
included so you can move the executables elsewhere.
getlast - Program to find last post number (after catenated from
detag). Defaults to "yahooGetLast.pl" (in package)
under execdir.
locus - Just a name tag in the database for these scans. A
descriptive is suggested.
parser - The message parser program. Defaults to
"yahooMsgScan.pl" (part of this package) under the
directory specified in execdir.
puller - The puller program. ypull.pl under execdir.
scanner - The message parser program. yahooMsgScan.pl under
execdir.
tempdir - Location where ypull writes some temporary files.
workdir - Location where ypull writes the message files of messages
retrieved from yahoo. ybScan subsequently deletes
them from here after writing them to the DB.
There are three basic programs which are run from the immediate directory where these reside.
They are ypull.pl, ybScan.pl and UpdateRecs.pl. All of these require an argument of the prefix
of the properties file. (viz "./ypull.pl scox" for scox.properties). Both can use a -d <number>
</number> option for debug level. -d 1 will give more info per process; -d 2 will give a lot
more in the ybScan... you probably don't want to do that unless you are fixing a code problem.
ypull.pl optionally takes additional arguments of the first and last message numbers to pull.
A zero (or not specified) for the first number tells it to look in the messages table for the
maximum message number for that domain and board id (the last will be rescanned). ypull doesn't
update the table; it writes the data in the work directory (with a magic tag on the top saying
what message number) as <name>-<msgnum>.post.
ybScan.pl invokes detag and yahooMsgScan for all .post files in the work directory with a prefix
of the properties file name and a ".post" suffix. After processing successfully ybScan deletes
them from the work directory.
UpdateRecs.pl will by default go back 2000 records from the most recent and update the number
of recs on posts. A starting message number may be supplied as an optional parameter.
You may wish to set up a simple cron script which looks like (vary for version):
cd ~/ybsnarfz-0.1.0
./ypull.pl scox
./ybScan scox
... and have it run every day, hour or however often you feel like. Similarly you may
wish to run UpdateRecs.pl somewhat less often.
The Messages Table:
The full text of this definition is in yboard.tabdef.
domain varchar(64) # A descriptive name for the domain.
locus varchar(64) # A descripive name for the particular part of domain.
boardid varchar(64) # A unique identifier in the domain
msgn bigint(20) # The message number, unique in domain and boardid.
poster varchar(255) # Who sent the message
posttime datetime # Date and time of posting
recs int(11) # How well regarded was this message.
title varchar(255) # Subject line of the message
refmsg bigint(20) # Parent message number.
refby varchar(255) # Parent message poster.
msg blob # Text of the message.
PRIMARY KEY (domain,boardid,msgn),
KEY msgn (msgn),
KEY poster (poster),
KEY refmsg (refmsg),
KEY postTime (postTime)
Usage Notes:
sh-3.00$
sh-3.00$ ./ypull.pl -h
usage is [path/]ypull.pl [-h][-d n] <prop> [start] [end]
This program pulls messages from the Yahoo Finance board and
saves them for further processing.
A -h (or no arguments) will print this message.
A -d n will turn on debugging if n is not zero.
<prop> is the prefix name of a .properties file.
"start" is the first message number; if zero or not specified
ypull will find the last message in the database and restart
from there.
"end" is the last message number to pull. If unspecified or
zero ypull will find the last message number currently
available on the board and use that.
sh-3.00$
sh-3.00$
sh-3.00$ ./ybScan.pl -h
usage is [path/]ybScan.pl [-h][-d n] <prop>
This program takes Yahoo Finance Board posts as captured by
ypull.pl, parses out the information and saves it in a
database table, removing the posts from the archive created
by ypull.
-h (or no arguments) causes this message to print.
-d n will display some or lots of debug information if n is 1 or
higher.
<prop> is the prefix of a .properties files specifying how to
handle this information.
sh-3.00$
sh-3.00$
sh-3.00$ ./UpdateRecs.pl
usage is [path/]UpdateRecs.pl [-h][-d n] <prop> [start]
This program updates the number of recs in the messages table.
saves them for further processing.
A -h (or no arguments) will print this message.
A -d n will turn on debugging if n is not zero.
<prop> is the prefix name of a .properties file.
"start" is the first message number; if zero or not specified
this will find the last message in the database and start
from 2000 back.
sh-3.00$
sh-3.00$
ybsnarfz.php
This is a php script/program to display data in useful ways. It depends on a configuration
(<name>-properties.php) file being in the same directory as the php script, and readable by
apache (or web server program); this configuration is identical to the .properties file used
by the perl programs save for language changes for PHP. This displays in both list and
threaded modes.
xybsnarfz.php
Very similar to ybsnarfz.php but will truncate display of message at 600 characters and
include an iframe back to the yahoo board.
Gotchas:
The primary gotcha in this system is that it does not respect multiple space lines. More
than one blank line in the input will always be reduced to a single blank line. That could
be fixed... do I want to?
-- TWZ