Downloading your PDFs from CiteULike using python and selenium

CiteULike, an online bibliography manager, is unfortunately shutting down. They have a service to export your bibliography as a BibTex file, but this does not include the PDFs you have uploaded to the site. Having web access to the PDFs is one of the main reasons I liked CiteULike (along with the tag cloud).

I have too many PDFs to download them all manually (over 2,000), so I wrote a script in Python to download the PDFs. Unlike prior scraping examples I’ve written about, you need to have signed into your CiteULike account to be able to download the files. Hence I use the selenium library to mimic what you do normally in a web-browser.

So let me know what bibliography manager I should switch to. Really one of the main factors will be if I can automate the conversion, including PDFs (even if it just means pointing to where the PDF is stored on my local machine).

This is a good tutorial to know about even if you don’t have anything to do with CiteULike. There are various web services that you need to sign in or mimic the browser like this to download data repeatedly, such as if a PD has a system where you need to input a set of dates to get back crime incidents (and limit the number returned, so you need to do it repeatedly to get a full sample). The selenium library can be used in a similar fashion to this tutorial in that circumstance.

Leave a comment

9 Comments

  1. Also appears that there is an API script using mechanize (and a CiteULike API I was not familiar with), https://github.com/AceCentre/ace-search-engine/blob/master/scripts/citeusync.py.

    Via Will Wade on the CiteULike discussion forums, http://www.citeulike.org/groupforum/4546?highlight=57847#msg_57847

    Reply
  2. B

     /  March 5, 2019

    Does your method also download the tags? I too am a heavy user of the “tag cloud” and would like to download my entire library *including tags* and attachments (PDFs mostly). Many thanks.

    Reply
    • You can get those when you download the bibtex file, they are plopped into the keywords field.

      Reply
      • B

         /  March 6, 2019

        Many thanks. I had searched my .bib for certain keywords and didn’t find them, probably this was an export or other user error on my part. I redid it and now see the tags in the keywords field, just as you described.

  3. Louis guilbault

     /  May 5, 2019

    I have been using CiteUlike for the past 10 years or so. I just realized this week (May 3rd 2019) that CiteUlike closed recently. Is it still possible to download all of my references that are on CiteUlike. If so, could please let me know how to proceed. Thanks, Louis

    Reply
  4. B

     /  May 5, 2019

    Louis, the citeulike.org website is currently unreachable for me in a browser, which does not bode well. However, still worth trying to export your data using scripts.

    In case you are not successful in using Andrew Wheeler’s neat script from this page, you may wish to also try another Python script (from Will Wade) that worked for me. It can handle cases where multiple PDFs are attached to a single citeulike entry. You’ll need to pass in your username and password as command-line arguments:

    https://github.com/AceCentre/ace-search-engine/blob/master/scripts/citeusync.py

    citeusync.py -u yourusername -p yourpassword

    Reply
  5. Hi I think that I am quite late but I just realised that Citeulike has been discontinued… I tried the Citeusync script and it doesn’t work as it throws this error:
    >urllib2.URLError:

    I guess that this is because the website/server/database is not reachable any more.

    Do you know if Citeulike owners can be contacted in any way?

    Reply
  1. Web scraping police data using selenium and python | Andrew Wheeler

Leave a Reply to B Cancel reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: