myelin: blogging ecosystem - dataset

the list | dataset | spider info | more applications | archive | press | linking back

All of this is generated by a Python script that does some crawling, writes a whole heap of statistics, then dumps out all its internal data structures to a file in pickle format.

Getting going

If you want to mess around with it, get the data (tar.gz format or zip format, ~1 MB) and blog.py.

Now load the data like this:

$ tar -vzxf linkData.tar.gz
$ python
>>> from blog import *
>>> import pickle
>>> blogs = pickle.load( open( 'linkData.pickle' ) )

Now you have one great big dict called blogs which contains all the data as Blog objects. It's indexed by URL (slightly munged to merge together things like http://scripting.com/ and http://www.scripting.com).

No idea what to do with it? Start by doing something like:

>>> import pprint
>>> pprint.pprint( blogs.keys() )

Now find your blog:

>>> for url in blogs.keys():
...	if url.find( 'pycs' ) != -1:
...		print url

And get some data about it:

>>> pprint.pprint( dict( [ ( key, getattr( blog, key ) ) for key in dir( blog ) ] ) )

Now it's up to you. Take a look at the applications. If you have any trouble, get in touch and I'll see if I can sort you out.

License

You may use and redistribute the ecosystem data available from this page for any purpose, as long as appropriate credit is given. For academic use, "appropriate credit" would be a reference to the Blogging Ecosystem pages in any material based on the data. For web applications, something along the lines of "weblog interconnection data courtesy of the Blogging Ecosystem" at the bottom of pages using the data will do. Ask for clarification. Links to Myelin are always appreciated ;-)

- Phil