Feed aggregator

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Slashdot - Thu, 2024-12-12 08:35
Harvard University announced Thursday it's releasing a high-quality dataset of nearly one million public-domain books that could be used by anyone to train large language models and other AI tools. From a report: The dataset was created by Harvard's newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright. Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta's Llama, the Institutional Data Initiative's database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to "level the playing field" by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. "It's gone through rigorous review," he says. Leppert believes the new public domain database could be used in conjunction with other licensed materials to build artificial intelligence models. "I think about it a bit like the way that Linux has become a foundational operating system for so much of the world," he says, noting that companies would still need to use additional training data to differentiate their models from those of their competitors.

Read more of this story at Slashdot.

Categories: Computer, News

CodeSOD: Ready Xor Not

The Daily WTF - Thu, 2024-12-12 07:30

Phil's company hired a contractor. It was the typical overseas arrangement: bundle up a pile of work, send it off to another timezone, receive back terrible code, push back during code review, then the whole thing blows up when the contracting company pushes back about how while the code review is in the contract if you're going to be such sticklers about it, they'll never deliver, and then management steps in and says, "Just keep the code review to style comments," and then it ends up not mattering anyway because the contractor assigned to the contract leaves for another contracting company, and management opts to use the remaining billable hours for a new feature instead of finishing the inflight work, so you inherit a half-finished pile of trash and somehow have to make it work.

Like I said, pretty standard stuff.

Phil found this construct scattered all over the codebase:

if cond1 and cond2: pass elif cond1 or cond2: # do actual work

I hesitate to post this, because what we're looking at is just an attempt at doing a xor operation. And it's not wrong- it's an if statement way of writing (not a and b) or (a and not b). And if we're being nit-picky, while Python has a xor operator, it's technically a bitwise xor, so I could see someone not connecting that they could use it in this case- cond1 ^ cond2 would work just fine, so long as both conditions are actual booleans. But Python often uses non-boolean comparisons, like:

text = "" if text: print("This won't print.")

This is playing with truthiness, and the problem here is that you can't use a xor to chain these conditions together.

if text ^ otherText: # do stuff

That's a runtime error, as the ^ is only defined for integral types. You'd have to write:

if bool(text) ^ bool(otherText): # do stuff

So, would it have been better to use one of the logical equivalences for xor? Certainly. Would it have been even better to turn that equivalence into a function so you could actually call a xor function? Absolutely.

But I also can't complain too much about this one. I hate it, don't get me wrong, but it's not a trainwreck.

.comment { border: none; } [Advertisement] BuildMaster allows you to create a self-service release management platform that allows different teams to manage their applications. Explore how!
Categories: Computer

Amazon Says Developers Spend 'Just One Hour Per Day' on Actual Coding

Slashdot - Thu, 2024-12-12 05:30
An anonymous reader shares a report: Amazon Web Services said in a post earlier this month that developers report spending an average of "just one hour per day" on actual coding. But that doesn't mean these workers twiddle their thumbs the remaining seven hours per day. Instead, developers spend the majority of their time on "tedious, undifferentiated tasks such as learning codebases, writing and reviewing documentation, testing, managing deployments, troubleshooting issues or finding and fixing vulnerabilities," according to Amazon Web Services.

Read more of this story at Slashdot.

Categories: Computer, News

Pages